Submitted by Devinco001 t3_ywjd26 in MachineLearning
[removed]
Submitted by Devinco001 t3_ywjd26 in MachineLearning
[removed]
Yes, I have done that. It's after dropping the duplicates, the count is coming 10M
Without seeing the code it will be impossible to help here
Can you run stuff in parallel? Are you using the cloud? Run concurrent processes somehow, or if you already are, go even wider. No reason you can’t divide this into 10 separate 1 million word processes running at the same time on different machines. Not sure if you are doing that yet or no
First thing that popped into my mind was using a tree structure. A quick Google search (“suffix tree spellcheck”) led to this helpful article below that proposes the use of BK-trees (a tree that essential sorts child noes by Levenshtein edit distance). Crucially when doing a look up “Time Complexity will be O(L1L2log n), here L1 is the average length of word in our dictionary and L2 is the length of misspelled.”
https://www.geeksforgeeks.org/bk-tree-introduction-implementation/amp/
Thanks! I BK tree is much faster. While researching of BK trees, I found out symspell algorithm, which is even faster. So going to use that for now, because of its high accuracy and faster time.
Sure, but its just a for loop, looping through the words in the dictionary, and using a python library 'python-levenshtein' to calculate distance between the dictionary words and the mispelled word.
For now, I am skipping levenshtein for a faster approximate distance, using symspell algorithm. It is highly accurate and much faster. Reduced computation time from 21 days to 13 hours
Thanks, I am using a different algorithm now, symspell. But haven't used multi threading till now. A really good idea, would speed anything up several times
Yeah, I looked at some LM at huggingface for filling the mask. They looked good, but required significant memory and computational resources to train.
This approach is the best, but due to resource constraints, I might have to fall back on simpler algorithms. Currently, I am going to use symspell. It is surprisingly highly accurate and fast.
But I will keep looking for a less resource hungry LM, since lookup time is low and they better catch the context and grammer. Using levenshtein and varying model output will increase its accuracy further many times. Ultimately, will be shifting to that, thanks
Yeah, I googled BK tree. The data structure is amazing and reduces a lot of computational time. While searching that, I found another algorithm, symspell. This is even faster with high accuracy but doesn't use levenshtein, so currently going to use that.
But BK trees are the go to method if using pure levenshtein and similar, more accurate, string distances. So will be comparing the accuracy of both algos to choose the better one. Thanks
Glad it was helpful!
[deleted]
goedel777 t1_iwjxp76 wrote
unique(incorrectly_spelled_words)