murrdpirate t1_j07k4v2 wrote on December 14, 2022 at 5:19 PM

Reply to comment by Internal-Diet-514 in [P] Implemented Vision Transformers 🚀 from scratch using TensorFlow 2.x by TensorDudee

I don't think "worse" is a clear description. The issue is just that it's too complex for CIFAR-10 alone. Any model can be increased in complexity until it overfits, and thus performs worse.

A model that doesn't overfit on CIFAR-10 is unlikely to benefit from pretraining on other datasets. Unless somehow the other datasets are more closely aligned to CIFAR-10 Test than CIFAR-10 Train is.

Internal-Diet-514 t1_j07s3t2 wrote on December 14, 2022 at 6:09 PM

I think that’s why we have to be careful how we add complexity. The same model with more parameters will overfit quicker because it can start to memorize the training set, but if we add complexity in its ability to model more meaningful relationships in the data tied to the response than I think overfitting would still happen, but we’d still get better validation performance. So maybe VIT for cifar-10 didn’t add any additional capabilities that were worth it for the problem, just additional complexity.

murrdpirate t1_j087lji wrote on December 14, 2022 at 7:47 PM

>I think overfitting would still happen, but we’d still get better validation performance.

I think by definition, overfitting means your validation performance decreases (or at least does not increase).

>So maybe VIT for cifar-10 didn’t add any additional capabilities that were worth it for the problem, just additional complexity

Depends on what you mean by "the problem." The problem could be:

Get the best possible performance on CIFAR-10 Test
Get the best possible performance on CIFAR-10 Test, but only train on CIFAR-10 Train

Even if it was the second one, you could likely just reduce the complexity of the VIT model and have it outperform other models. Or keep it the same, but use heavy regularization during training.