cautioushedonist t1_iwffey3 wrote on November 15, 2022 at 5:49 AM

#533,367

It sounds like out of all words with which the misspelling has the highest levenshtein score, you need some way to find out what's the most "right" word out of them.

For example -

Original : Mary had a lill lamb

Soundex : Mary had a (little/lit/let/list) lamb

The words in brackets are the ones with very high levenshtein score and you want to find out which one is the most right, right?

Solution -

I would leverage any large language model by "masking" the misspelling and letting the model predict what word should go there. Each of model's prediction will come with a confidence score which can help you make a ranked list of model's prediction.

So, once you have LLM's rankedlist and Soundex's rankedlist, you should be able to come up with a heuristic to find the most right word based on their ranks in each list.

You should be able to easily get started on LLM part based on the example here - https://huggingface.co/tasks/fill-mask

Lmk if you have any questions!

dunkyboy05 t1_iwfgre6 wrote on November 15, 2022 at 6:04 AM

#533,445

Replying to cautioushedonist (#533,367)

That's brilliant

Devinco001 OP t1_iwgaljf wrote on November 15, 2022 at 12:45 PM

#534,899

Replying to cautioushedonist (#533,367)

Yes, for example if I have a word 'baend' and I make it go through soundex + levenshtein, it gives me 'band' and 'bend', both with a distance of 1. So I want to basically decide which of the words would be a better choice.

Yes, the LM idea is awesome. But I am a bit low on memory and disk space. On hugging face, the LM which pops up for filling mask is quite large, with significant computational time.

Can this be done without LM, like some frequency tables, etc.? Or is there an LM sort of thing where I can input the highest ranked soundex words and get the confidence score for each? Or is there an optimized LM for this task, I tried finding it but didn't get one till now.

cautioushedonist t1_iwgzytj wrote on November 15, 2022 at 4:02 PM

#536,646

Replying to Devinco001 (#534,899)

Can you confirm if the example on this webpage doesn't work for you in terms of size and memory?

Example is under How to use on

https://huggingface.co/distilbert-base-uncased?text=The+goal+of+life+is+%5BMASK%5D

Now, if you're open to paying for GPT3 services then this answer might be helpful.

https://stackoverflow.com/questions/73370817/how-to-use-gpt-3-for-fill-mask-tasks

This will be API calls and so you wouldn't need to worry about inference times and sizes.

So, you can either find the smallest LM possible that can work with fill-mask or use some API service to get around size/memory bottlenecks.

Devinco001 OP t1_iwmaat8 wrote on November 16, 2022 at 6:16 PM

#546,878

Replying to cautioushedonist (#536,646)

I actually saw this very example first, yeah it requires a good amount of computational power which my pc currently lacks. API calls I can but there would be rate limits to it, which needs to be payed to extend usage, that is why I have to drop that approach

I was actually looking for a non language model based approach for now, since language models are computation heavy. I am currently going to use symspell python library, since it is faster, though less accurate. Once I increase my Ram, I will surely start using LM since these are far better in accuracy. Thanks

Paid-Not-Payed-Bot t1_iwmabtx wrote on November 16, 2022 at 6:16 PM

#546,881

Replying to Devinco001 (#546,878)

> to be paid to extend

FTFY.

Although payed exists (the reason why autocorrection didn't help you), it is only correct in:

Nautical context, when it means to paint a surface, or to cover with something like tar or resin in order to make it waterproof or corrosion-resistant. The deck is yet to be payed.
Payed out when letting strings, cables or ropes out, by slacking them. The rope is payed out! You can pull now.

Unfortunately, I was unable to find nautical or rope-related words in your comment.

Beep, boop, I'm a bot

[D] Phonetic Algorithm Spellcheck Metric

Comments