Hello Everyone 👋,

I just implemented the paper named AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE popularly known as the vision transformer paper. This paper uses a Transformer encoder for image recognition. It achieves state-of-the-art performance without using convolutional layers given that we have a huge dataset and enough computational resources.
Below I am sharing my implementation of this paper, please have a look and give it a 🌟 if you like it. This implementation provides easy-to-read code for understanding how the model works internally.

My implementation: GitHub Link

Thanks for your attention. 😀

Comments

You must log in or register to comment.

CatalyzeX_code_bot t1_j06c8zw wrote on December 14, 2022 at 11:40 AM

#927,191

Found relevant code at https://github.com/google-research/vision_transformer + all code implementations here

To opt out from receiving code links, DM me

Keepclamand- t1_j06k9ky wrote on December 14, 2022 at 1:05 PM

#927,574

Interesting. Can share some results and learning?

Deep-Station-1746 t1_j06kayz wrote on December 14, 2022 at 1:05 PM

#927,577

Solid work. This reminds me of that internet explorer meme.

TensorDudee OP t1_j06keu3 wrote on December 14, 2022 at 1:06 PM

#927,585

Replying to Deep-Station-1746 (#927,577)

Who so?

TensorDudee OP t1_j06kh4d wrote on December 14, 2022 at 1:07 PM

#927,589

Replying to Keepclamand- (#927,574)

Without pretraining the model overfits on CIFAR 10.

Deep-Station-1746 t1_j06ku87 wrote on December 14, 2022 at 1:10 PM

#927,613

Replying to TensorDudee (#927,585)

Everything is slow and hard to implement on tensorflow, without much redeemable excuses either (compared to JAX e.g.).

S8nSins t1_j06plvw wrote on December 14, 2022 at 1:50 PM

#927,907

Replying to Deep-Station-1746 (#927,613)

This guy must be still using Tensorflow 1.x

TensorDudee OP t1_j06stvm wrote on December 14, 2022 at 2:15 PM

#928,113

Replying to S8nSins (#927,907)

I do not know about others but TensorFlow 1.x (tf.compat.v1) is still my favorite. But the learning curve is steep.

Valdaora t1_j06u0vk wrote on December 14, 2022 at 2:24 PM

#928,178

LOL Just learn pytorch

therealjtgill t1_j070otn wrote on December 14, 2022 at 3:11 PM

#928,552

Replying to TensorDudee (#928,113)

Same - 1.15 is my favorite

Internal-Diet-514 t1_j073xp9 wrote on December 14, 2022 at 3:33 PM

#928,718

Replying to TensorDudee (#927,589)

Stuff like that always makes me wonder. I mean if they had to train it on several other datasets before training it on CIFAR-10, isn’t it a worse architecture (for the specific problem) than one that performs well trained from scratch on CIFAR-10? And if that model followed the same training procedure as the VIT I wonder if it would beat it.

MOSFETBJT t1_j078ma9 wrote on December 14, 2022 at 4:04 PM

#928,949

Thanks dude. Tensorflow gets a lot of hate on this sub. But I think part of it is people memeing

Erosis t1_j07aho6 wrote on December 14, 2022 at 4:16 PM

#929,065

Replying to Deep-Station-1746 (#927,613)

Yet people here praise Torch when Tensorflow equivalents are often faster in production. Tensorflow still has relevance and gets a bit too much hate here (and I personally prefer pytorch).

[deleted] t1_j07awk0 wrote on December 14, 2022 at 4:19 PM

#929,084

[deleted]

pyepyepie t1_j07bgek wrote on December 14, 2022 at 4:23 PM

#929,116

Replying to Internal-Diet-514 (#928,718)

Just my 2 cents, ignoring the specific model details (as I don't do vision): Well, you would assume every model works differently on different data. For example, try to train a large NN on 10 examples that are y = mx + b, and then try to do the same but with a linear model. The same applies also in less clear situations, i.e. larger models that require more data vs larger models that are more sample efficient but introduce more bias.

nucLeaRStarcraft t1_j07bufu wrote on December 14, 2022 at 4:25 PM

#929,147

Replying to Internal-Diet-514 (#928,718)

We're generally trying to maximize the available labeled data. If the Transformer can ingest more data and in the end performs better than any other non-attention based model, given the same amount of data, then, it's a better architecture.

However, you are asking a proper question, but I think the body of recent work shows that the Transformer indeed generalizes better. Otherwise, we'd see similar results with non-transformed based architectures, since the data and compute is already there for these groups who do this kind of research.

pyepyepie t1_j07gugl wrote on December 14, 2022 at 4:58 PM

#929,438

Replying to nucLeaRStarcraft (#929,147)

I think it's kind of important to state what our models do better, I really dislike this SOTA thing on some dataset, Internal-Diet has a point here.

murrdpirate t1_j07k4v2 wrote on December 14, 2022 at 5:19 PM

#929,610

Replying to Internal-Diet-514 (#928,718)

I don't think "worse" is a clear description. The issue is just that it's too complex for CIFAR-10 alone. Any model can be increased in complexity until it overfits, and thus performs worse.

A model that doesn't overfit on CIFAR-10 is unlikely to benefit from pretraining on other datasets. Unless somehow the other datasets are more closely aligned to CIFAR-10 Test than CIFAR-10 Train is.

TensorDudee OP t1_j07oard wrote on December 14, 2022 at 5:45 PM

#929,852

Guys if you like it please show some ♥️ by starring the repository.

Internal-Diet-514 t1_j07pfk6 wrote on December 14, 2022 at 5:52 PM

#929,923

Replying to nucLeaRStarcraft (#929,147)

On your first paragraph when you say given the same amount of data isn’t it shown here that the VIT was given more data as it was trained with other datasets as well, before being fine tuned on cifar-10? And then compared to other models which were most likely trained on cifar-10 alone? I guess my worry is if we’re going to do a proper comparison between models that they should all follow the same training procedure. You can reach SOTA performance on a dataset using other techniques rather than architecture alone.

anymorenevermore t1_j07q7qw wrote on December 14, 2022 at 5:57 PM

#929,955

Replying to TensorDudee (#929,852)

is this the "subscribe to my OF" of nerds?

Internal-Diet-514 t1_j07qmb0 wrote on December 14, 2022 at 6:00 PM

#929,969

Replying to pyepyepie (#929,116)

I agree with you, it’s just now a days when people say they have created an architecture that outperforms some baseline they really means it outperforms some baseline on image net or cifar or some other established dataset. All data is different and I really think the focus should be what added ability does this architecture have to model relationships between the input data that a baseline doesn’t and how does that help with this specific problem. Which is why the transformer was such a great architecture to begin with for NLP problems because it demonstrated the ability to model longer range dependencies over an LSTM like architecture. I’m just not sure it translated well to vision when we begin to say it’s better than a pure CNN based architecture.

Internal-Diet-514 t1_j07s3t2 wrote on December 14, 2022 at 6:09 PM

#930,034

Replying to murrdpirate (#929,610)

I think that’s why we have to be careful how we add complexity. The same model with more parameters will overfit quicker because it can start to memorize the training set, but if we add complexity in its ability to model more meaningful relationships in the data tied to the response than I think overfitting would still happen, but we’d still get better validation performance. So maybe VIT for cifar-10 didn’t add any additional capabilities that were worth it for the problem, just additional complexity.

assimil8or t1_j07u7jt wrote on December 14, 2022 at 6:23 PM

#930,140

Replying to Internal-Diet-514 (#928,718)

Who still cares about CIFAR-10 though? I know it’s a standard dataset but just seem like it’s completely solved in so many different ways … better to look at harder problems.

skadoodlee t1_j07z95s wrote on December 14, 2022 at 6:55 PM

#930,420

Replying to MOSFETBJT (#928,949)

Why does it get hate

murrdpirate t1_j087lji wrote on December 14, 2022 at 7:47 PM

#930,816

Replying to Internal-Diet-514 (#930,034)

>I think overfitting would still happen, but we’d still get better validation performance.

I think by definition, overfitting means your validation performance decreases (or at least does not increase).

>So maybe VIT for cifar-10 didn’t add any additional capabilities that were worth it for the problem, just additional complexity

Depends on what you mean by "the problem." The problem could be:

Get the best possible performance on CIFAR-10 Test
Get the best possible performance on CIFAR-10 Test, but only train on CIFAR-10 Train

Even if it was the second one, you could likely just reduce the complexity of the VIT model and have it outperform other models. Or keep it the same, but use heavy regularization during training.

nucLeaRStarcraft t1_j08cjvc wrote on December 14, 2022 at 8:18 PM

#931,055

Replying to Internal-Diet-514 (#929,923)

I agree with you, if we want to test the architecture, we should use the same training procedure, including pre-training.

My theory is, that given the current results of GPT-like models, which use transformers under the hood, and given the fact that these groups have the compute power and data to train non-attention based recurrent models, it's quite unlikely that the architecture isn't a main contributor.

M4xM9450 t1_j08eql4 wrote on December 14, 2022 at 8:31 PM

#931,165

Replying to skadoodlee (#930,420)

It started out being not as “pythonic” as pytorch and so people flocked over to pytorch. Many new papers and models are implemented in pytorch and very few see the point in converting them to tensorflow since many of these models just run on desktops or servers. That said, both frameworks have their ups and downs. I myself have started with keras when it first got integrated into tensorflow and haven’t really wanted to use pytorch because it’s limited in being brought to web/mobile apps.