Comments

You must log in or register to comment.

skelly0311 t1_iwzz7td wrote

For starters, why are you generating word embeddings? First the Bert model generates word embeddings by tokenizing strings into a pre trained word vector, then you run those embeddings through a transformer for some type of inference. So, I'll assume you're feeding those word embeddings into an actual transformer for inference. If this is true.

  1. depends on time requirements. Larger models will generally be more accurate, but also take a lot more time to perform inference than smaller models
  2. See above
  3. In my experience, and according to papers, ELECTRA and RoBERTA are variants of BERT that have outperformed BERT on experiments
  4. Again, for inference, this depends on many factors, such as the max amount of tokens per inference example
  5. https://mccormickml.com/2019/07/22/BERT-fine-tuning/
3

Devinco001 OP t1_ix08pyr wrote

I am going to use the embeddings for clustering the text in an unsupervised manner to get the popular intents actually.

1,2. Would be fine with a bit of trade off in accuracy. Time is the main concern, since I want it not to take more than a day. Maybe, I have to use something other then BERT

  1. Googled them out and RoBERTA seems to be the best choice. Much better than base BERT or larger BERT

  2. I actually asked this because Google collab has some restrictions on the free usage

  3. Thanks, really good article

1

pagein t1_ix2wkue wrote

If you want to cluster sentences, take a look in LABSE. This model was specially designed for embedding extraction. https://ai.googleblog.com/2020/08/language-agnostic-bert-sentence.html?m=1

2

Devinco001 OP t1_ix710w3 wrote

This looks really interesting, thanks. Is it open source?

1

pagein t1_ix71gqd wrote

There are several pretrained implementations:

  • Pytorch implemenatation using HuggingFace Transformers Library under Apache 2.0 license
  • Original Tensorflow model on Tensorflow Hub under the same Apache 2.0 license.
2

GitGudOrGetGot t1_ix3s761 wrote

>First the Bert model generates word embeddings by tokenizing strings into a pre trained word vector, then you run those embeddings through a transformer for some type of inference

Could you describe this a bit further in terms of inputs and outputs?

I think I get htat you go from a string to a list of individual tokens, but when you say you then feed that into a Pre Trained Word Vector, does that mean you output a list of floating point values representing the document as a single point in high dimensional space?

I thought that's specifically what the transformer does, so not sure what other role it performs here...

1

LetterRip t1_ix0zyfv wrote

what length of texts? sentence? paragraph? page? multiple pages? books?

A sentence might average 10 tokens, a page 750 tokens, a book 225,000 tokens. So 25 million to 562.5 billion tokens.

2

Devinco001 OP t1_ix2ewbe wrote

Yes, they are short, conversational based. Business intent. Average token length around 10. Total approx 2.5 million sentences

1