Submitted by SAbdusSamad t3_10siibd in MachineLearning

Hello everyone,

I'm interested in diving into the field of computer vision and I recently came across the concept of Vision Transformer (ViT). I want to understand this concept in depth but I'm not sure what prerequisites I need to have in order to grasp the concept fully.

Do I need to have a strong background in Recurrent Neural Networks (RNNs) and Transformer (Attention Is All You Need) to understand ViT, or can I get by just knowing the basics of deep learning and Convolutional Neural Networks (CNNs)?

I would really appreciate if someone could shed some light on this and provide some guidance.

Thank you in advance!

85

Comments

You must log in or register to comment.

the_architect_ai t1_j71izep wrote

I suggest you just dive straight in. Part of learning is to find out what you don’t know and slowly cover your bases from there.

61

Jurph t1_j71nymu wrote

I recommend diving in, but getting out a notepad and writing down any term you don't understand. So if you get two paragraphs in and someone says this simply replaces back-propagation, making the updated weights sufficient for the skip-layer convolution and you realize that you don't understand back-prop or weights or skip-layer convolution ... then you probably need to stop, go learn those ideas, and then go back and try again.

For deep neural nets, back-propagation, etc., there will be a point where a full understanding will require calculus or other strong mathematic principles. For example, you can't accurately explain why back-prop works without a basic intuition for the Chain Rule. Similarly, activation functions like ReLu and sigmoid require a strong algebraic background for their graphs to be a useful shorthand. But you can "take it on faith" that it works, treat that part of the system like a black box, and revisit it once you understand what it's doing.

I would say the biggest piece of foundational knowledge is the idea of "functions", their role in mappings and transforms, and how things similar to Newton's Method are meant to work to get approximate solutions after several steps. A lot of machine learning is based on the idea of expressing the problem as a composed set of mathematical expressions that can be solved iteratively. Grasping the idea of a "loss function" that can be minimized is core to the entire discipline.

18

juanigp t1_j71p88u wrote

matrix multiplication, linear projections, dot product

−3

atharvat80 t1_j71u3oa wrote

If you want to take the top down approach I'd recommend that you start by learning what transformers are. Transformers were originally intended for language modelling so if you look up a NLP lecture series like Stanford CS224n they cover that in detail form a NLP perspective, it should be helpful regardless. Or you can check out CS231n they have a whole lecture on attention, transformers and ViT. Start there and look up the stuff thats unclear from there.

Lmk of you'd like me to link any other resources, I'll edit this later. Happy learning!

13

new_name_who_dis_ t1_j71w8up wrote

If I recall correctly, ViT is a purely transformer based architecture. So you don't need to know RNNs or CNNs, just transformers.

7

JustOneAvailableName t1_j71yj42 wrote

Understanding what is extremely easy and rather useless, to understand a paper you need to understand some level of why. If you have time to go in depth, aim to understand the what not and why not.

So I would argue at least some basic knowledge of CNNs is required.

2

Erosis t1_j72rzdl wrote

You'll probably be fine learning transformers directly, but a better understanding of RNNs might make some of the NLP tutorials/papers containing transformers more easily comprehensible.

Attention is an very important component of transformers, but attention can be applied to RNNs, too.

3

juanigp t1_j73a6z4 wrote

It was my grain of sand, self attention is a bunch of matrix multiplications. 12 layers of the same, it makes sense to understand why QK^t. If the question would have been how to understand maskrcnn the answer would have been different.

Edit: 12 layers in ViT base / BERT base

0

Jurph t1_j73ozbe wrote

Hey, I dove into "Progressive Growing of GANs" without knowing what weights were. And now here I am, four or five years later. I've trained my own classifiers based on ViTs, DNNs, written python interfaces for them, and I'm working on tooling to make Automatic1111's GUI behave better with Stable Diffusion. We've all got to start somewhere.

3

SAbdusSamad OP t1_j757w05 wrote

I recently obtained a PDF of the book and began searching for information on ViT. Unfortunately, it appears that the book does not cover this topic. However, I plan to utilize the Transformer chapter to gain an understanding of ViT.

1

icanelectoo t1_j75h90j wrote

Look up some papers that discuss them, then look up the papers those paper refers to. Write out a summary as if you had to explain it to someone else who's never seen it before.

Alternatively you could ask chatGPT.

2

teenaxta t1_j76i085 wrote

most ViT discussions or videos I saw assume you have an idea of attention and transformers

watch this video series to get an idea of attention and transformers in general and then you'll be good to go

https://www.youtube.com/watch?v=mMa2PmYJlCo

2

SimonJDPrince t1_j7htrs8 wrote

Pretty much nothing to get through the first half. High school calculus and a basic grasp of probability. Should be accessible to almost everyone. Second half needs more knowledge of probability, but I'm filling out appendices with this info.

1