It sounds like out of all words with which the misspelling has the highest levenshtein score, you need some way to find out what's the most "right" word out of them.
For example -
Original : Mary had a lill lamb
Soundex : Mary had a (little/lit/let/list) lamb
The words in brackets are the ones with very high levenshtein score and you want to find out which one is the most right, right?
Solution -
I would leverage any large language model by "masking" the misspelling and letting the model predict what word should go there. Each of model's prediction will come with a confidence score which can help you make a ranked list of model's prediction.
So, once you have LLM's rankedlist and Soundex's rankedlist, you should be able to come up with a heuristic to find the most right word based on their ranks in each list.
Yes, for example if I have a word 'baend' and I make it go through soundex + levenshtein, it gives me 'band' and 'bend', both with a distance of 1. So I want to basically decide which of the words would be a better choice.
Yes, the LM idea is awesome. But I am a bit low on memory and disk space. On hugging face, the LM which pops up for filling mask is quite large, with significant computational time.
Can this be done without LM, like some frequency tables, etc.? Or is there an LM sort of thing where I can input the highest ranked soundex words and get the confidence score for each? Or is there an optimized LM for this task, I tried finding it but didn't get one till now.
I actually saw this very example first, yeah it requires a good amount of computational power which my pc currently lacks. API calls I can but there would be rate limits to it, which needs to be payed to extend usage, that is why I have to drop that approach
I was actually looking for a non language model based approach for now, since language models are computation heavy. I am currently going to use symspell python library, since it is faster, though less accurate. Once I increase my Ram, I will surely start using LM since these are far better in accuracy. Thanks
Although payed exists (the reason why autocorrection didn't help you), it is only correct in:
Nautical context, when it means to paint a surface, or to cover with something like tar or resin in order to make it waterproof or corrosion-resistant. The deck is yet to be payed.
Payed out when letting strings, cables or ropes out, by slacking them. The rope is payed out! You can pull now.
Unfortunately, I was unable to find nautical or rope-related words in your comment.
cautioushedonist t1_iwffey3 wrote
It sounds like out of all words with which the misspelling has the highest levenshtein score, you need some way to find out what's the most "right" word out of them.
For example -
Original : Mary had a lill lamb
Soundex : Mary had a (little/lit/let/list) lamb
The words in brackets are the ones with very high levenshtein score and you want to find out which one is the most right, right?
Solution -
I would leverage any large language model by "masking" the misspelling and letting the model predict what word should go there. Each of model's prediction will come with a confidence score which can help you make a ranked list of model's prediction.
So, once you have LLM's rankedlist and Soundex's rankedlist, you should be able to come up with a heuristic to find the most right word based on their ranks in each list.
You should be able to easily get started on LLM part based on the example here - https://huggingface.co/tasks/fill-mask
Lmk if you have any questions!