Submitted by TensorDudee t3_zloof9 in MachineLearning

Hello Everyone 👋,

I just implemented the paper named AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE popularly known as the vision transformer paper. This paper uses a Transformer encoder for image recognition. It achieves state-of-the-art performance without using convolutional layers given that we have a huge dataset and enough computational resources.
Below I am sharing my implementation of this paper, please have a look and give it a 🌟 if you like it. This implementation provides easy-to-read code for understanding how the model works internally.

My implementation: GitHub Link

Thanks for your attention. 😀

162

Comments

You must log in or register to comment.

Keepclamand- t1_j06k9ky wrote

Interesting. Can share some results and learning?

8

Valdaora t1_j06u0vk wrote

LOL Just learn pytorch

−13

Internal-Diet-514 t1_j073xp9 wrote

Stuff like that always makes me wonder. I mean if they had to train it on several other datasets before training it on CIFAR-10, isn’t it a worse architecture (for the specific problem) than one that performs well trained from scratch on CIFAR-10? And if that model followed the same training procedure as the VIT I wonder if it would beat it.

5

MOSFETBJT t1_j078ma9 wrote

Thanks dude. Tensorflow gets a lot of hate on this sub. But I think part of it is people memeing

22

Erosis t1_j07aho6 wrote

Yet people here praise Torch when Tensorflow equivalents are often faster in production. Tensorflow still has relevance and gets a bit too much hate here (and I personally prefer pytorch).

4

pyepyepie t1_j07bgek wrote

Just my 2 cents, ignoring the specific model details (as I don't do vision): Well, you would assume every model works differently on different data. For example, try to train a large NN on 10 examples that are y = mx + b, and then try to do the same but with a linear model. The same applies also in less clear situations, i.e. larger models that require more data vs larger models that are more sample efficient but introduce more bias.

2

nucLeaRStarcraft t1_j07bufu wrote

We're generally trying to maximize the available labeled data. If the Transformer can ingest more data and in the end performs better than any other non-attention based model, given the same amount of data, then, it's a better architecture.

However, you are asking a proper question, but I think the body of recent work shows that the Transformer indeed generalizes better. Otherwise, we'd see similar results with non-transformed based architectures, since the data and compute is already there for these groups who do this kind of research.

3

murrdpirate t1_j07k4v2 wrote

I don't think "worse" is a clear description. The issue is just that it's too complex for CIFAR-10 alone. Any model can be increased in complexity until it overfits, and thus performs worse.

A model that doesn't overfit on CIFAR-10 is unlikely to benefit from pretraining on other datasets. Unless somehow the other datasets are more closely aligned to CIFAR-10 Test than CIFAR-10 Train is.

8

TensorDudee OP t1_j07oard wrote

Guys if you like it please show some ♥️ by starring the repository.

−3

Internal-Diet-514 t1_j07pfk6 wrote

On your first paragraph when you say given the same amount of data isn’t it shown here that the VIT was given more data as it was trained with other datasets as well, before being fine tuned on cifar-10? And then compared to other models which were most likely trained on cifar-10 alone? I guess my worry is if we’re going to do a proper comparison between models that they should all follow the same training procedure. You can reach SOTA performance on a dataset using other techniques rather than architecture alone.

2

Internal-Diet-514 t1_j07qmb0 wrote

I agree with you, it’s just now a days when people say they have created an architecture that outperforms some baseline they really means it outperforms some baseline on image net or cifar or some other established dataset. All data is different and I really think the focus should be what added ability does this architecture have to model relationships between the input data that a baseline doesn’t and how does that help with this specific problem. Which is why the transformer was such a great architecture to begin with for NLP problems because it demonstrated the ability to model longer range dependencies over an LSTM like architecture. I’m just not sure it translated well to vision when we begin to say it’s better than a pure CNN based architecture.

5

Internal-Diet-514 t1_j07s3t2 wrote

I think that’s why we have to be careful how we add complexity. The same model with more parameters will overfit quicker because it can start to memorize the training set, but if we add complexity in its ability to model more meaningful relationships in the data tied to the response than I think overfitting would still happen, but we’d still get better validation performance. So maybe VIT for cifar-10 didn’t add any additional capabilities that were worth it for the problem, just additional complexity.

1

murrdpirate t1_j087lji wrote

>I think overfitting would still happen, but we’d still get better validation performance.

I think by definition, overfitting means your validation performance decreases (or at least does not increase).

>So maybe VIT for cifar-10 didn’t add any additional capabilities that were worth it for the problem, just additional complexity

Depends on what you mean by "the problem." The problem could be:

  1. Get the best possible performance on CIFAR-10 Test
  2. Get the best possible performance on CIFAR-10 Test, but only train on CIFAR-10 Train

Even if it was the second one, you could likely just reduce the complexity of the VIT model and have it outperform other models. Or keep it the same, but use heavy regularization during training.

4

nucLeaRStarcraft t1_j08cjvc wrote

I agree with you, if we want to test the architecture, we should use the same training procedure, including pre-training.

My theory is, that given the current results of GPT-like models, which use transformers under the hood, and given the fact that these groups have the compute power and data to train non-attention based recurrent models, it's quite unlikely that the architecture isn't a main contributor.

2

M4xM9450 t1_j08eql4 wrote

It started out being not as “pythonic” as pytorch and so people flocked over to pytorch. Many new papers and models are implemented in pytorch and very few see the point in converting them to tensorflow since many of these models just run on desktops or servers. That said, both frameworks have their ups and downs. I myself have started with keras when it first got integrated into tensorflow and haven’t really wanted to use pytorch because it’s limited in being brought to web/mobile apps.

15