skelly0311 t1_iwzz7td wrote on November 19, 2022 at 5:58 PM

#580,712

For starters, why are you generating word embeddings? First the Bert model generates word embeddings by tokenizing strings into a pre trained word vector, then you run those embeddings through a transformer for some type of inference. So, I'll assume you're feeding those word embeddings into an actual transformer for inference. If this is true.

depends on time requirements. Larger models will generally be more accurate, but also take a lot more time to perform inference than smaller models
See above
In my experience, and according to papers, ELECTRA and RoBERTA are variants of BERT that have outperformed BERT on experiments
Again, for inference, this depends on many factors, such as the max amount of tokens per inference example
https://mccormickml.com/2019/07/22/BERT-fine-tuning/

Devinco001 OP t1_ix08pyr wrote on November 19, 2022 at 7:05 PM

#581,500

Replying to skelly0311 (#580,712)

I am going to use the embeddings for clustering the text in an unsupervised manner to get the popular intents actually.

1,2. Would be fine with a bit of trade off in accuracy. Time is the main concern, since I want it not to take more than a day. Maybe, I have to use something other then BERT

Googled them out and RoBERTA seems to be the best choice. Much better than base BERT or larger BERT
I actually asked this because Google collab has some restrictions on the free usage
Thanks, really good article

LetterRip t1_ix0zyfv wrote on November 19, 2022 at 10:23 PM

#583,461

what length of texts? sentence? paragraph? page? multiple pages? books?

A sentence might average 10 tokens, a page 750 tokens, a book 225,000 tokens. So 25 million to 562.5 billion tokens.

Devinco001 OP t1_ix2ewbe wrote on November 20, 2022 at 5:32 AM

#587,272

Replying to LetterRip (#583,461)

Yes, they are short, conversational based. Business intent. Average token length around 10. Total approx 2.5 million sentences

pagein t1_ix2wkue wrote on November 20, 2022 at 9:25 AM

#588,397

Replying to Devinco001 (#581,500)

If you want to cluster sentences, take a look in LABSE. This model was specially designed for embedding extraction. https://ai.googleblog.com/2020/08/language-agnostic-bert-sentence.html?m=1

GitGudOrGetGot t1_ix3s761 wrote on November 20, 2022 at 3:15 PM

#590,456

Replying to skelly0311 (#580,712)

>First the Bert model generates word embeddings by tokenizing strings into a pre trained word vector, then you run those embeddings through a transformer for some type of inference

Could you describe this a bit further in terms of inputs and outputs?

I think I get htat you go from a string to a list of individual tokens, but when you say you then feed that into a Pre Trained Word Vector, does that mean you output a list of floating point values representing the document as a single point in high dimensional space?

I thought that's specifically what the transformer does, so not sure what other role it performs here...

Devinco001 OP t1_ix710w3 wrote on November 21, 2022 at 5:50 AM

#598,069

Replying to pagein (#588,397)

This looks really interesting, thanks. Is it open source?

pagein t1_ix71gqd wrote on November 21, 2022 at 5:55 AM

#598,098

Replying to Devinco001 (#598,069)

There are several pretrained implementations:

Pytorch implemenatation using HuggingFace Transformers Library under Apache 2.0 license
Original Tensorflow model on Tensorflow Hub under the same Apache 2.0 license.

Devinco001 OP t1_ix75z7w wrote on November 21, 2022 at 6:51 AM

#598,369

Replying to pagein (#598,098)

Will surely check them out, thanks

[D] BERT related questions

Comments