Submitted by minhrongcon2000 t3_z8kit4 in MachineLearning
YouAgainShmidhoobuh t1_iyc24s6 wrote
Transformers gain the most when comparing size of training corpus and log likelihood performance. It is also in the scope of large data sets and large sequence lengths that transformers really stand out
minhrongcon2000 OP t1_iyc2a84 wrote
So it does mean that Transformer truly shines when the amount of data is huge right (maybe the word huge is a bit underwhelming for this)
entropyvsenergy t1_iyd6mw0 wrote
Transformers do well with lots of data. This is because the transformer is an extremely flexible and generic architecture. Unlike a fully connected neural network where each input is mapped through a weight matrix to the next layer and the weight matrices are fixed with respect to any input, transformers use attention blocks where the actual "effective" weight matrices are computed using the attention operation using query, key, and value vectors and thus depend on the inputs. What this means is that in order to train a transformer model you need a lot of data in order to get better performance than less flexible neural network architectures such as LSTMs or fully connected networks.
yannbouteiller t1_iydt54w wrote
Considering fully connected networks as "less flexible" than transformers sounds misleading. Although very generic, as far as I can see, transformers have much more inductive bias than, e.g., an MLP that would take the whole sequence of word embeddings as input.
Viewing a single comment thread. View all comments