029187 OP t1_iqt44bx wrote
Reply to comment by suflaj in [D] - Why do Attention layers work so well? Don't weights in DNNs already tell the network how much weight/attention to give to a specific input? (High weight = lots of attention, low weight = little attention) by 029187
>You are completely disregarding that just because it is possible to define a mathematical approximation it doesn't mean that there is an algorithm which can consistently lead the weights to it. Certainly in terms of self-attention the landscape optimization algorithms traverse is not really well-behaved.
Yeah I was more just trying to understand if there is a theoretical understanding of why the weights are not led to it via backprop. I 100% agree with your point though. Just because something CAN approximate doesn't mean there is an optimization algorithm that will lead to that approximation. If that were the case, every universal approximator would be as good as every other, which is clearly not the case.
​
>Theory mostly doesn't matter for deep learning because the subject is not even explainable by theory for the most part.
This is an interesting take. Can you elaborate a bit?
suflaj t1_iqt5hll wrote
>if there is a theoretical understanding of why the weights are not led to it via backprop
Not really. The intuition is that self-attention is a vastly different kernel than FC layers can handle. Especially with the whole dot product which I assume is the main culprit for it.
>This is an interesting take. Can you elaborate a bit?
I'm not sure how I could elaborate on this. If you read papers you will see that most of the stuff in DL has barely any theoretical basis. On the topic of transformers, about the most theoretical part of it is the normalization in self-attention scores (square root of d_k). Everything else in the original paper is mostly shooting in the dark. It's even more of a joke when you realize that they didn't even check different seeds to realize the one in the original paper gave them fairly bad results.
You can also check all the different transformer architectures that can't seem to converge into anything since the foundation for them is so bad and non-scientific, I'd dare say arbitrary. And then just as you think maybe you can get more hope with CNNs which aren't so arbitrary, you're met with a slightly different residual block in convnext that supposedly gives you results comparable to vision transformers, yet there is barely any theoretical basis over it, mostly intuition.
029187 OP t1_iqt73b1 wrote
>Not really. The intuition is that self-attention is a vastly different kernel than FC layers can handle. Especially with the whole dot product which I assume is the main culprit for it.
Interesting! That's good to know. I wonder with different optimizers if it will be possible in the future.
​
>If you read papers you will see that most of the stuff in DL has barely any theoretical basis. On the topic of transformers, about the most theoretical part of it is the normalization in self-attention scores (square root of d_k). Everything else in the original paper is mostly shooting in the dark. It's even more of a joke when you realize that they didn't even check different seeds to realize the one in the original paper gave them fairly bad results.
>
>You can also check all the different transformer architectures that can't seem to converge into anything since the foundation for them is so bad and non-scientific, I'd dare say arbitrary. And then just as you think maybe you can get more hope with CNNs which aren't so arbitrary, you're met with a slightly different residual block in convnext that supposedly gives you results comparable to vision transformers, yet there is barely any theoretical basis over it, mostly intuition.
This was actually a very good elaboration, thank you. Keep in mind, to you this was probably obvious, but to other folks like me this is very insightful.
suflaj t1_iqt971l wrote
Ah sorry, based on your responses I was convinced you were reading papers so my response might have been overly aggressive due on the incredibly negative experience I have had while reading relevant DL papers. It truly feels like the only difference between SOTA and a garbage paper is that SOTA somehow got to work on a specific machine, specific setup and specific training run. And this spills into whole of DL.
Hopefully you will not have the misfortune of trying to replicate some of the papers that either don't have a repo linked or which are not maintained by a large corporation, you might understand better what I meant.
029187 OP t1_iqtcofx wrote
>Ah sorry, based on your responses I was convinced you were reading papers so my response might have been overly aggressive due on the incredibly negative experience I have had while reading relevant DL papers.
Its all good. I'm happy to hear your thoughts.
I've read some papers but I'm by no means an expert. Ironically I've actually used ML in a professional setting, but most of my work is very much "let's run some models and use the most accurate one". Generally squeezing an extra percent via SOTA models is not worth it, so I don't deal with them much.
I do try to keep up to date with latest models, but it all seems so trial-and-error, which I think is what you were getting at.
In addition, there is a lot of incorrect theory out there which makes it even harder for amateurs or semi-pros like me. I still see videos on YouTube to this day claiming DNNs are effective because they are universal approximators, which is clearly not the reason, since there are tons of universal approximator models besides DNNs that cannot be trained as effectively on problem sets like image recognition or NLP. Universal Approximation is likely necessary but almost certainly not sufficient.
I've been reading papers like the lotter-ticket-hypothesis which seem like they are trying to give some insight into why DNNs are a useful architecture, as well as Google's follow-up paper about rigging the lottery.
Those papers have gotten me pretty interested into reading up on why these models work so well, but it seems that when you look into it the results are as you've said, it's a lot of trial and error without much of a theoretical underpinning. Of course, I'm not expert, so I don't want to poo poo the work that a lot of very smart and experienced folks are doing.
Viewing a single comment thread. View all comments