Viewing a single comment thread. View all comments

nucLeaRStarcraft t1_j07bufu wrote

We're generally trying to maximize the available labeled data. If the Transformer can ingest more data and in the end performs better than any other non-attention based model, given the same amount of data, then, it's a better architecture.

However, you are asking a proper question, but I think the body of recent work shows that the Transformer indeed generalizes better. Otherwise, we'd see similar results with non-transformed based architectures, since the data and compute is already there for these groups who do this kind of research.

3

pyepyepie t1_j07gugl wrote

I think it's kind of important to state what our models do better, I really dislike this SOTA thing on some dataset, Internal-Diet has a point here.

3

Internal-Diet-514 t1_j07pfk6 wrote

On your first paragraph when you say given the same amount of data isn’t it shown here that the VIT was given more data as it was trained with other datasets as well, before being fine tuned on cifar-10? And then compared to other models which were most likely trained on cifar-10 alone? I guess my worry is if we’re going to do a proper comparison between models that they should all follow the same training procedure. You can reach SOTA performance on a dataset using other techniques rather than architecture alone.

2

nucLeaRStarcraft t1_j08cjvc wrote

I agree with you, if we want to test the architecture, we should use the same training procedure, including pre-training.

My theory is, that given the current results of GPT-like models, which use transformers under the hood, and given the fact that these groups have the compute power and data to train non-attention based recurrent models, it's quite unlikely that the architecture isn't a main contributor.

2