029187
029187 OP t1_iqud0ju wrote
Reply to comment by RobKnight_ in [D] - Why do Attention layers work so well? Don't weights in DNNs already tell the network how much weight/attention to give to a specific input? (High weight = lots of attention, low weight = little attention) by 029187
true but the attention layers immediately overcome locality.
029187 OP t1_iqth0xk wrote
Reply to comment by pia322 in [D] - Why do Attention layers work so well? Don't weights in DNNs already tell the network how much weight/attention to give to a specific input? (High weight = lots of attention, low weight = little attention) by 029187
I'm kinda scared by the idea that we get all the way to strong AI and still don't understand why it works.
029187 OP t1_iqtcofx wrote
Reply to comment by suflaj in [D] - Why do Attention layers work so well? Don't weights in DNNs already tell the network how much weight/attention to give to a specific input? (High weight = lots of attention, low weight = little attention) by 029187
>Ah sorry, based on your responses I was convinced you were reading papers so my response might have been overly aggressive due on the incredibly negative experience I have had while reading relevant DL papers.
Its all good. I'm happy to hear your thoughts.
I've read some papers but I'm by no means an expert. Ironically I've actually used ML in a professional setting, but most of my work is very much "let's run some models and use the most accurate one". Generally squeezing an extra percent via SOTA models is not worth it, so I don't deal with them much.
I do try to keep up to date with latest models, but it all seems so trial-and-error, which I think is what you were getting at.
In addition, there is a lot of incorrect theory out there which makes it even harder for amateurs or semi-pros like me. I still see videos on YouTube to this day claiming DNNs are effective because they are universal approximators, which is clearly not the reason, since there are tons of universal approximator models besides DNNs that cannot be trained as effectively on problem sets like image recognition or NLP. Universal Approximation is likely necessary but almost certainly not sufficient.
I've been reading papers like the lotter-ticket-hypothesis which seem like they are trying to give some insight into why DNNs are a useful architecture, as well as Google's follow-up paper about rigging the lottery.
Those papers have gotten me pretty interested into reading up on why these models work so well, but it seems that when you look into it the results are as you've said, it's a lot of trial and error without much of a theoretical underpinning. Of course, I'm not expert, so I don't want to poo poo the work that a lot of very smart and experienced folks are doing.
029187 OP t1_iqt73b1 wrote
Reply to comment by suflaj in [D] - Why do Attention layers work so well? Don't weights in DNNs already tell the network how much weight/attention to give to a specific input? (High weight = lots of attention, low weight = little attention) by 029187
>Not really. The intuition is that self-attention is a vastly different kernel than FC layers can handle. Especially with the whole dot product which I assume is the main culprit for it.
Interesting! That's good to know. I wonder with different optimizers if it will be possible in the future.
​
>If you read papers you will see that most of the stuff in DL has barely any theoretical basis. On the topic of transformers, about the most theoretical part of it is the normalization in self-attention scores (square root of d_k). Everything else in the original paper is mostly shooting in the dark. It's even more of a joke when you realize that they didn't even check different seeds to realize the one in the original paper gave them fairly bad results.
>
>You can also check all the different transformer architectures that can't seem to converge into anything since the foundation for them is so bad and non-scientific, I'd dare say arbitrary. And then just as you think maybe you can get more hope with CNNs which aren't so arbitrary, you're met with a slightly different residual block in convnext that supposedly gives you results comparable to vision transformers, yet there is barely any theoretical basis over it, mostly intuition.
This was actually a very good elaboration, thank you. Keep in mind, to you this was probably obvious, but to other folks like me this is very insightful.
029187 OP t1_iqt44bx wrote
Reply to comment by suflaj in [D] - Why do Attention layers work so well? Don't weights in DNNs already tell the network how much weight/attention to give to a specific input? (High weight = lots of attention, low weight = little attention) by 029187
>You are completely disregarding that just because it is possible to define a mathematical approximation it doesn't mean that there is an algorithm which can consistently lead the weights to it. Certainly in terms of self-attention the landscape optimization algorithms traverse is not really well-behaved.
Yeah I was more just trying to understand if there is a theoretical understanding of why the weights are not led to it via backprop. I 100% agree with your point though. Just because something CAN approximate doesn't mean there is an optimization algorithm that will lead to that approximation. If that were the case, every universal approximator would be as good as every other, which is clearly not the case.
​
>Theory mostly doesn't matter for deep learning because the subject is not even explainable by theory for the most part.
This is an interesting take. Can you elaborate a bit?
029187 OP t1_iqsz9t7 wrote
Reply to comment by hellrail in [D] - Why do Attention layers work so well? Don't weights in DNNs already tell the network how much weight/attention to give to a specific input? (High weight = lots of attention, low weight = little attention) by 029187
Yeah I get why the non-locality is useful, as CNNs group data locally, which doesn't make sense in graph data (the relevant word could be very far away in the sentence)
But a densely connected deep neural network already should have what it needs to map out any arbitrary function relating nodes on a graph.
029187 OP t1_iqsyxb4 wrote
Reply to comment by suflaj in [D] - Why do Attention layers work so well? Don't weights in DNNs already tell the network how much weight/attention to give to a specific input? (High weight = lots of attention, low weight = little attention) by 029187
The relationships between inputs are just mathematical functions. In principle, the DNN could also arrive at those functions. For example, if we imagine a dense network with >1 hidden layer, the first hidden layer is just looking at inputs and and their weights. But the subsequent layers are looking at combinations of the inputs and their weights, which could in principle be used to identify relationships. The more layers and nodes, the more complex the relationship. As DNNs are universal approximators, this must be true. (Although, clearly just because something can be approximated in theory doesn't mean the DNN will actually converge to it via backprop)
Clearly though in a lot of use cases the attention network is converging faster and more accurately.
Has there been a lot of research on what in particular allows the attention layers to achieve this?
029187 OP t1_iqrinm2 wrote
Reply to comment by dasayan05 in [Discussion] If we had enough memory to always do full batch gradient descent, would we still need rmsprop/momentum/adam? by 029187
If its only as good, then it has no benefit. But if it ends up being better, then it is useful for situations where we have enough memory.
​
https://arxiv.org/abs/2103.17182
​
This paper here is claiming they might have found interesting ways to potentially make it better.
029187 OP t1_iqrihvc wrote
Reply to comment by Red-Portal in [Discussion] If we had enough memory to always do full batch gradient descent, would we still need rmsprop/momentum/adam? by 029187
thanks!!
029187 OP t1_iqpigzv wrote
Reply to comment by Red-Portal in [Discussion] If we had enough memory to always do full batch gradient descent, would we still need rmsprop/momentum/adam? by 029187
ah cool! do you have any links to papers on the topic? i'd love to read them!
029187 OP t1_iqp6s79 wrote
Reply to comment by dasayan05 in [Discussion] If we had enough memory to always do full batch gradient descent, would we still need rmsprop/momentum/adam? by 029187
what if, as another poster said, we did full batch but also injected noise into it?
029187 OP t1_izveeec wrote
Reply to comment by patient_zer00 in [D] - Has Open AI said what ChatGPT's architecture is? What technique is it using to "remember" previous prompts? by 029187
That is surprisingly clever.