Submitted by Devinco001 t3_105la5f in MachineLearning

Hi everyone, I have to cluster a large chunk of textual conversational business data to find relevant topics in it.

Since there is lot of abstract info in every text like phone, url, numbers, email, name, etc., I have done some basic NER using regex and spacy NER to tag such info and make the texts more generic and canonicalized.

But there are some things like product names, raw materials, brand/model, company, etc. which couldn't be tagged. Also, the accuracy of regex and spacy NER isn't high enough.

Can anyone suggest a good python NER library, which is accurate and fast enough, preferably has pre-trained models and can tag diverse fields.

Thanks.

4

Comments

You must log in or register to comment.

stu1011 t1_j3bp0om wrote

If spaCy’s NER isn’t picking up what you need, you’ll probably need to look into creating your own annotations and fine tuning a model or training a custom model. It isn’t too hard using BIO/BILOU tags. Things like “raw materials” and particularly niche models and brands are unlikely to be picked up by off the shelf solutions.

4

Just_CurioussSss t1_j3c8yom wrote

One option is Stanford NER, which is a named entity recognition tool developed by Stanford University. It uses a CRF (conditional random field) model trained on a large dataset of named entities, and it's relatively fast and accurate. Stanford NER also has pre-trained models available for various languages, so you could use one of these models or train your own model on a custom dataset.
Another option is spaCy, which is a popular natural language processing (NLP) library for Python. spaCy includes a named entity recognition component that uses a convolutional neural network (CNN) to identify named entities in text. It's generally quite accurate and fast, and it has pre-trained models available for various languages. spaCy also provides tools for training custom models on your own dataset, if you have specific named entities that you'd like the model to recognize.
Finally, you might also consider using the Google Cloud Natural Language API, which is a cloud-based NER service provided by Google. The API uses a machine learning model to identify named entities in text, and it's generally quite accurate and fast. It has pre-trained models available for various languages, and it provides tools for training custom models on your own dataset.

1

Anjum48 t1_j3lgduo wrote

Are you using the "en_core_web_trf" model in Spacy which is based on the roberta-base transformer model?

If that model is still not accurate enough, you may need to look into using the Hugging Face transformers library and try some more recent transformer models, e.g. deberta

2