YouAgainShmidhoobuh t1_iyc24s6 wrote on November 30, 2022 at 7:31 AM

#797,291

Transformers gain the most when comparing size of training corpus and log likelihood performance. It is also in the scope of large data sets and large sequence lengths that transformers really stand out

minhrongcon2000 OP t1_iyc2a84 wrote on November 30, 2022 at 7:33 AM

#797,328

Replying to YouAgainShmidhoobuh (#797,291)

So it does mean that Transformer truly shines when the amount of data is huge right (maybe the word huge is a bit underwhelming for this)

DaLameLama t1_iyc7nha wrote on November 30, 2022 at 8:48 AM

#798,569

I don't think that's true. It would imply that Bi-LSTMs reach good performance faster than Transformers, and Transformers catch up later during training.

I've never seen proof for that, nor do my personal experiences confirm this.

OutrageousSundae8270 t1_iyc9bnw wrote on November 30, 2022 at 9:13 AM

#799,000

Transformers do generally need to be pre-trained on a large corpus to do well on further downstream tasks.

suflaj t1_iycm2mj wrote on November 30, 2022 at 12:08 PM

#802,095

Depends on the transformer, but generally yes. Pretraining BERT costs like 10k$ in compute, maybe less now. You can train BiLSTM models from scratch on a single consumer card for a similar task in a day or so.

IntelArtiGen t1_iycm9kk wrote on November 30, 2022 at 12:10 PM

#802,152

It depends on the accuracy you want, I can train a transformer in 30 min with 30k sentences on an RTX2070 Super and get meaningful embeddings (similar words are close to each others), it works but same as for all models it won't be SOTA if you don't use billions of sentences and a much larger model with much more GPUs.

I was told the same thing and I wouldn't agree, you need a huge pretraining process if you want SOTA results, if you can compromise you don't need as much data, but LSTM might perform better with little data.

entropyvsenergy t1_iyd6mw0 wrote on November 30, 2022 at 3:07 PM

#808,602

Replying to minhrongcon2000 (#797,328)

Transformers do well with lots of data. This is because the transformer is an extremely flexible and generic architecture. Unlike a fully connected neural network where each input is mapped through a weight matrix to the next layer and the weight matrices are fixed with respect to any input, transformers use attention blocks where the actual "effective" weight matrices are computed using the attention operation using query, key, and value vectors and thus depend on the inputs. What this means is that in order to train a transformer model you need a lot of data in order to get better performance than less flexible neural network architectures such as LSTMs or fully connected networks.

yannbouteiller t1_iydt54w wrote on November 30, 2022 at 5:36 PM

#815,740

Replying to entropyvsenergy (#808,602)

Considering fully connected networks as "less flexible" than transformers sounds misleading. Although very generic, as far as I can see, transformers have much more inductive bias than, e.g., an MLP that would take the whole sequence of word embeddings as input.

[D] Does Transformer need huge pretraining process?

Comments