Hello everyone,

I'm interested in diving into the field of computer vision and I recently came across the concept of Vision Transformer (ViT). I want to understand this concept in depth but I'm not sure what prerequisites I need to have in order to grasp the concept fully.

Do I need to have a strong background in Recurrent Neural Networks (RNNs) and Transformer (Attention Is All You Need) to understand ViT, or can I get by just knowing the basics of deep learning and Convolutional Neural Networks (CNNs)?

I would really appreciate if someone could shed some light on this and provide some guidance.

Thank you in advance!

Comments

You must log in or register to comment.

the_architect_ai t1_j71izep wrote on February 3, 2023 at 12:13 PM

#1,711,817

I suggest you just dive straight in. Part of learning is to find out what you don’t know and slowly cover your bases from there.

AerysSk t1_j71kz0d wrote on February 3, 2023 at 12:33 PM

#1,711,904

Replying to the_architect_ai (#1,711,817)

This is the correct attitude. Dive in, and if you meet obstacles, find it. It's what makes the learning journey fun: you don't just learn one thing, but many things.

[deleted] t1_j71m65o wrote on February 3, 2023 at 12:45 PM

#1,711,965

[removed]

Jurph t1_j71nymu wrote on February 3, 2023 at 1:02 PM

#1,712,039

I recommend diving in, but getting out a notepad and writing down any term you don't understand. So if you get two paragraphs in and someone says this simply replaces back-propagation, making the updated weights sufficient for the skip-layer convolution and you realize that you don't understand back-prop or weights or skip-layer convolution ... then you probably need to stop, go learn those ideas, and then go back and try again.

For deep neural nets, back-propagation, etc., there will be a point where a full understanding will require calculus or other strong mathematic principles. For example, you can't accurately explain why back-prop works without a basic intuition for the Chain Rule. Similarly, activation functions like ReLu and sigmoid require a strong algebraic background for their graphs to be a useful shorthand. But you can "take it on faith" that it works, treat that part of the system like a black box, and revisit it once you understand what it's doing.

I would say the biggest piece of foundational knowledge is the idea of "functions", their role in mappings and transforms, and how things similar to Newton's Method are meant to work to get approximate solutions after several steps. A lot of machine learning is based on the idea of expressing the problem as a composed set of mathematical expressions that can be solved iteratively. Grasping the idea of a "loss function" that can be minimized is core to the entire discipline.

juanigp t1_j71p88u wrote on February 3, 2023 at 1:13 PM

#1,712,093

matrix multiplication, linear projections, dot product

atharvat80 t1_j71u3oa wrote on February 3, 2023 at 1:53 PM

#1,712,341

If you want to take the top down approach I'd recommend that you start by learning what transformers are. Transformers were originally intended for language modelling so if you look up a NLP lecture series like Stanford CS224n they cover that in detail form a NLP perspective, it should be helpful regardless. Or you can check out CS231n they have a whole lecture on attention, transformers and ViT. Start there and look up the stuff thats unclear from there.

Lmk of you'd like me to link any other resources, I'll edit this later. Happy learning!

new_name_who_dis_ t1_j71w8up wrote on February 3, 2023 at 2:09 PM

#1,712,474

If I recall correctly, ViT is a purely transformer based architecture. So you don't need to know RNNs or CNNs, just transformers.

JustOneAvailableName t1_j71yj42 wrote on February 3, 2023 at 2:26 PM

#1,712,611

Replying to new_name_who_dis_ (#1,712,474)

Understanding what is extremely easy and rather useless, to understand a paper you need to understand some level of why. If you have time to go in depth, aim to understand the what not and why not.

So I would argue at least some basic knowledge of CNNs is required.

SAbdusSamad OP t1_j71z0zp wrote on February 3, 2023 at 2:30 PM

#1,712,638

Replying to JustOneAvailableName (#1,712,611)

Well, I do have idea about CNNs. I have limited knowledge of RNNs. But I don't have knowledge of Attention is All You Need.

tripple13 t1_j723bf0 wrote on February 3, 2023 at 3:00 PM

#1,712,877

Replying to new_name_who_dis_ (#1,712,474)

I strongly disagree. Having an understanding of seq2seq prior Transformers, goes a long way.

new_name_who_dis_ t1_j723k5w wrote on February 3, 2023 at 3:01 PM

#1,712,889

Replying to tripple13 (#1,712,877)

I mean the more you understand the better obviously. But it's not necessary, it's just context for what we don't do anymore.

nicholsz t1_j728g2l wrote on February 3, 2023 at 3:34 PM

#1,713,167

Replying to juanigp (#1,712,093)

OS. Kernel. Bus. Processor. Transistor. p-n junction

SimonJDPrince t1_j72bw7l wrote on February 3, 2023 at 3:56 PM

#1,713,348

Explained in my forthcoming book:

https://udlbook.github.io/udlbook/

Should be a good place to start, and if it isn't then I'm really interested to know where you struggled so I can improve the explanation.

Erosis t1_j72rzdl wrote on February 3, 2023 at 5:38 PM

#1,714,210

Replying to SAbdusSamad (#1,712,638)

You'll probably be fine learning transformers directly, but a better understanding of RNNs might make some of the NLP tutorials/papers containing transformers more easily comprehensible.

Attention is an very important component of transformers, but attention can be applied to RNNs, too.

[deleted] t1_j72u4c2 wrote on February 3, 2023 at 5:51 PM

#1,714,308

Replying to Jurph (#1,712,039)

[deleted]

fermangas t1_j734867 wrote on February 3, 2023 at 6:55 PM

#1,714,864

Replying to SimonJDPrince (#1,713,348)

I was going to recommend this book. You beat me to it.

juanigp t1_j73a6z4 wrote on February 3, 2023 at 7:33 PM

#1,715,155

Replying to nicholsz (#1,713,167)

It was my grain of sand, self attention is a bunch of matrix multiplications. 12 layers of the same, it makes sense to understand why QK^t. If the question would have been how to understand maskrcnn the answer would have been different.

Edit: 12 layers in ViT base / BERT base

[deleted] t1_j73h9fl wrote on February 3, 2023 at 8:18 PM

#1,715,549

[removed]

Jurph t1_j73ozbe wrote on February 3, 2023 at 9:07 PM

#1,715,935

Replying to [deleted] (#1,714,308)

Hey, I dove into "Progressive Growing of GANs" without knowing what weights were. And now here I am, four or five years later. I've trained my own classifiers based on ViTs, DNNs, written python interfaces for them, and I'm working on tooling to make Automatic1111's GUI behave better with Stable Diffusion. We've all got to start somewhere.

SAbdusSamad OP t1_j757w05 wrote on February 4, 2023 at 3:56 AM

#1,718,742

Replying to SimonJDPrince (#1,713,348)

I recently obtained a PDF of the book and began searching for information on ViT. Unfortunately, it appears that the book does not cover this topic. However, I plan to utilize the Transformer chapter to gain an understanding of ViT.

SAbdusSamad OP t1_j758r29 wrote on February 4, 2023 at 4:04 AM

#1,718,792

Replying to the_architect_ai (#1,711,817)

Great advice. This seems to be a good starting point.

SAbdusSamad OP t1_j75922f wrote on February 4, 2023 at 4:07 AM

#1,718,805

Replying to atharvat80 (#1,712,341)

These courses seem to have excellent content. I will definitely consider these as great resources.

SAbdusSamad OP t1_j759v4v wrote on February 4, 2023 at 4:13 AM

#1,718,837

Replying to Erosis (#1,714,210)

I agree that having a background in RNNs and attention with RNNs can make the learning process for transformers, and by extension ViT, much easier.

icanelectoo t1_j75h90j wrote on February 4, 2023 at 5:24 AM

#1,719,192

Look up some papers that discuss them, then look up the papers those paper refers to. Write out a summary as if you had to explain it to someone else who's never seen it before.

Alternatively you could ask chatGPT.

teenaxta t1_j76i085 wrote on February 4, 2023 at 1:23 PM

#1,720,733

most ViT discussions or videos I saw assume you have an idea of attention and transformers

watch this video series to get an idea of attention and transformers in general and then you'll be good to go

https://www.youtube.com/watch?v=mMa2PmYJlCo

SimonJDPrince t1_j784yjf wrote on February 4, 2023 at 8:26 PM

#1,724,001

Replying to SAbdusSamad (#1,718,742)

ViT is at the end of the transformers chapter. Perhaps I forgot to put it in the index?

SAbdusSamad OP t1_j79ub7q wrote on February 5, 2023 at 4:34 AM

#1,726,940

Replying to SimonJDPrince (#1,724,001)

I apologize for that oversight. Yes, the book does cover Transformers for images.

42gauge t1_j7eal36 wrote on February 6, 2023 at 3:52 AM

#1,733,394

Replying to SimonJDPrince (#1,713,348)

What are the math/ML prerequisites for this text?

SimonJDPrince t1_j7htrs8 wrote on February 6, 2023 at 10:11 PM

#1,738,785

Replying to 42gauge (#1,733,394)

Pretty much nothing to get through the first half. High school calculus and a basic grasp of probability. Should be accessible to almost everyone. Second half needs more knowledge of probability, but I'm filling out appendices with this info.

jmmcd t1_j7qb9i1 wrote on February 8, 2023 at 5:34 PM

#1,754,854

Replying to SimonJDPrince (#1,713,348)

This book is really excellent! I'm working through it and collecting a few typos. I'll pass them on when done. I'm going to recommend it to my students this semester.