skelly0311

skelly0311 t1_j9scr5c wrote

First thing to note. The best way to improve generalisability and accuracy is to have as accurate data as possible. If your data is trash, it doesn't matter how many parameters your classifier is using, it will not produce good results.

Now, in my experience using with transformer neural networks, If the task is a simple binary classification task or multi label with less than 8 or so labels(maybe more), the small models(14 million parameters) perform similar to the base models(110 million parameters). Once the objective function becomes more complicated, such as training a zero shot learner, more parameters means achieving a much lower loss. In the case just mentioned, using the large models(335 million parameters) had a significant improvement over the base model(110 million parameters).

It's hard to define and quantify how complicated an objective function is. But just know that the more parameters doesn't always mean better if the objective function is simple enough.

1

skelly0311 t1_j0adie9 wrote

What algorithm are you using? If it learns in an iterative fashion, such as gradient descent, you can downsample a different random sample of the class that has more training examples every epoch of feed forward/backprop, thus not losing any information from the class that has more data.

I currently do this with multi label classification problems in NLP, where the classes are much more skewed than your use case.

3

skelly0311 t1_iwzz7td wrote

For starters, why are you generating word embeddings? First the Bert model generates word embeddings by tokenizing strings into a pre trained word vector, then you run those embeddings through a transformer for some type of inference. So, I'll assume you're feeding those word embeddings into an actual transformer for inference. If this is true.

  1. depends on time requirements. Larger models will generally be more accurate, but also take a lot more time to perform inference than smaller models
  2. See above
  3. In my experience, and according to papers, ELECTRA and RoBERTA are variants of BERT that have outperformed BERT on experiments
  4. Again, for inference, this depends on many factors, such as the max amount of tokens per inference example
  5. https://mccormickml.com/2019/07/22/BERT-fine-tuning/
3