Submitted by TensorDudee t3_zloof9 in MachineLearning
Hello Everyone 👋,
I just implemented the paper named AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE popularly known as the vision transformer paper. This paper uses a Transformer encoder for image recognition. It achieves state-of-the-art performance without using convolutional layers given that we have a huge dataset and enough computational resources.
Below I am sharing my implementation of this paper, please have a look and give it a 🌟 if you like it. This implementation provides easy-to-read code for understanding how the model works internally.
My implementation: GitHub Link
Thanks for your attention. 😀
CatalyzeX_code_bot t1_j06c8zw wrote
Found relevant code at https://github.com/google-research/vision_transformer + all code implementations here
--
To opt out from receiving code links, DM me