pyepyepie t1_j07bgek wrote on December 14, 2022 at 4:23 PM

Reply to comment by Internal-Diet-514 in [P] Implemented Vision Transformers 🚀 from scratch using TensorFlow 2.x by TensorDudee

Just my 2 cents, ignoring the specific model details (as I don't do vision): Well, you would assume every model works differently on different data. For example, try to train a large NN on 10 examples that are y = mx + b, and then try to do the same but with a linear model. The same applies also in less clear situations, i.e. larger models that require more data vs larger models that are more sample efficient but introduce more bias.

Internal-Diet-514 t1_j07qmb0 wrote on December 14, 2022 at 6:00 PM

I agree with you, it’s just now a days when people say they have created an architecture that outperforms some baseline they really means it outperforms some baseline on image net or cifar or some other established dataset. All data is different and I really think the focus should be what added ability does this architecture have to model relationships between the input data that a baseline doesn’t and how does that help with this specific problem. Which is why the transformer was such a great architecture to begin with for NLP problems because it demonstrated the ability to model longer range dependencies over an LSTM like architecture. I’m just not sure it translated well to vision when we begin to say it’s better than a pure CNN based architecture.

pyepyepie t1_j08pa80 wrote on December 14, 2022 at 9:37 PM

Ideas > performance, for sure :)