DeepNonseNse t1_iqvgzsk wrote on October 3, 2022 at 12:11 PM

Reply to comment by ResourceResearch in [D] - Why do Attention layers work so well? Don't weights in DNNs already tell the network how much weight/attention to give to a specific input? (High weight = lots of attention, low weight = little attention) by 029187

But then again, that just lead to another question: why are deep(er) architectures better in the first place?

Desperate-Whereas50 t1_iqwzlgc wrote on October 3, 2022 at 6:34 PM

I am not a transformer expert. So maybe this is a stupid question, but is this also true for transformer based architectures? For example BERT uses 12/24 transformer Blocks. Thats sounds not as deep as for example a resnet-256.

ResourceResearch t1_iro8zof wrote on October 9, 2022 at 7:39 PM

Afaik it is not clear. In my personal experience, the number of parameters is more important, rather then the layer size, i.e. a smaller number of wider layers does the same job as a large number of narrower layers.

Consider this paper for empirical insights for large models: https://arxiv.org/pdf/2001.08361.pdf