skelly0311 t1_j9scr5c wrote on February 24, 2023 at 5:14 AM

Reply to [D] Model size vs task complexity by Fine-Topic-6127

First thing to note. The best way to improve generalisability and accuracy is to have as accurate data as possible. If your data is trash, it doesn't matter how many parameters your classifier is using, it will not produce good results.

Now, in my experience using with transformer neural networks, If the task is a simple binary classification task or multi label with less than 8 or so labels(maybe more), the small models(14 million parameters) perform similar to the base models(110 million parameters). Once the objective function becomes more complicated, such as training a zero shot learner, more parameters means achieving a much lower loss. In the case just mentioned, using the large models(335 million parameters) had a significant improvement over the base model(110 million parameters).

It's hard to define and quantify how complicated an objective function is. But just know that the more parameters doesn't always mean better if the objective function is simple enough.

skelly0311 t1_j0adie9 wrote on December 15, 2022 at 5:11 AM

Reply to [D] Dealing with extremely imbalanced dataset by hopedallas

What algorithm are you using? If it learns in an iterative fashion, such as gradient descent, you can downsample a different random sample of the class that has more training examples every epoch of feed forward/backprop, thus not losing any information from the class that has more data.

I currently do this with multi label classification problems in NLP, where the classes are much more skewed than your use case.

skelly0311 t1_iwzz7td wrote on November 19, 2022 at 5:58 PM

Reply to [D] BERT related questions by Devinco001

For starters, why are you generating word embeddings? First the Bert model generates word embeddings by tokenizing strings into a pre trained word vector, then you run those embeddings through a transformer for some type of inference. So, I'll assume you're feeding those word embeddings into an actual transformer for inference. If this is true.

depends on time requirements. Larger models will generally be more accurate, but also take a lot more time to perform inference than smaller models
See above
In my experience, and according to papers, ELECTRA and RoBERTA are variants of BERT that have outperformed BERT on experiments
Again, for inference, this depends on many factors, such as the max amount of tokens per inference example
https://mccormickml.com/2019/07/22/BERT-fine-tuning/

skelly0311 t1_is8d9xa wrote on October 14, 2022 at 1:12 AM

Reply to [D] Are GAN(s) still relevant as a research topic? or is there any idea regarding research on generative modeling? by aozorahime

ELECTRA, which is a transformer variant of BERT uses a GAN in the pre training phase in order to get rid of the mask tokens discrepancy from transforms such as BERT and RoBERTa.

https://arxiv.org/abs/2003.10555