skelly0311 t1_iwzz7td wrote on November 19, 2022 at 5:58 PM

For starters, why are you generating word embeddings? First the Bert model generates word embeddings by tokenizing strings into a pre trained word vector, then you run those embeddings through a transformer for some type of inference. So, I'll assume you're feeding those word embeddings into an actual transformer for inference. If this is true.

depends on time requirements. Larger models will generally be more accurate, but also take a lot more time to perform inference than smaller models
See above
In my experience, and according to papers, ELECTRA and RoBERTA are variants of BERT that have outperformed BERT on experiments
Again, for inference, this depends on many factors, such as the max amount of tokens per inference example
https://mccormickml.com/2019/07/22/BERT-fine-tuning/

Devinco001 OP t1_ix08pyr wrote on November 19, 2022 at 7:05 PM

I am going to use the embeddings for clustering the text in an unsupervised manner to get the popular intents actually.

1,2. Would be fine with a bit of trade off in accuracy. Time is the main concern, since I want it not to take more than a day. Maybe, I have to use something other then BERT

Googled them out and RoBERTA seems to be the best choice. Much better than base BERT or larger BERT
I actually asked this because Google collab has some restrictions on the free usage
Thanks, really good article

pagein t1_ix2wkue wrote on November 20, 2022 at 9:25 AM

If you want to cluster sentences, take a look in LABSE. This model was specially designed for embedding extraction. https://ai.googleblog.com/2020/08/language-agnostic-bert-sentence.html?m=1

Devinco001 OP t1_ix710w3 wrote on November 21, 2022 at 5:50 AM

This looks really interesting, thanks. Is it open source?

pagein t1_ix71gqd wrote on November 21, 2022 at 5:55 AM

There are several pretrained implementations:

Pytorch implemenatation using HuggingFace Transformers Library under Apache 2.0 license
Original Tensorflow model on Tensorflow Hub under the same Apache 2.0 license.

Devinco001 OP t1_ix75z7w wrote on November 21, 2022 at 6:51 AM

Will surely check them out, thanks

GitGudOrGetGot t1_ix3s761 wrote on November 20, 2022 at 3:15 PM

>First the Bert model generates word embeddings by tokenizing strings into a pre trained word vector, then you run those embeddings through a transformer for some type of inference

Could you describe this a bit further in terms of inputs and outputs?

I think I get htat you go from a string to a list of individual tokens, but when you say you then feed that into a Pre Trained Word Vector, does that mean you output a list of floating point values representing the document as a single point in high dimensional space?

I thought that's specifically what the transformer does, so not sure what other role it performs here...

LetterRip t1_ix0zyfv wrote on November 19, 2022 at 10:23 PM

what length of texts? sentence? paragraph? page? multiple pages? books?

A sentence might average 10 tokens, a page 750 tokens, a book 225,000 tokens. So 25 million to 562.5 billion tokens.

Devinco001 OP t1_ix2ewbe wrote on November 20, 2022 at 5:32 AM

Yes, they are short, conversational based. Business intent. Average token length around 10. Total approx 2.5 million sentences