"On average, across 4 NLP tasks and 10 attention types, single layer wide models perform 0.3% better than their deep counterparts"

Discussions of some wide attention results

https://twitter.com/andrey_kurenkov/status/1579474438822985728

Comments

You must log in or register to comment.

suflaj t1_is327qu wrote on October 12, 2022 at 11:01 PM

#89,628

I had to check that I'm not on a satire sub comparing the title and that quote.

farmingvillein t1_is35lpb wrote on October 12, 2022 at 11:26 PM

#89,789

Replying to suflaj (#89,628)

Well, the key claim of the paper (which OP should have instead reflected in the top-level post) is not that there is a big accuracy increase, but that performance is equal or better, while being computationally advantaged:

> We show an in-depth evaluation and demonstrate how wide models require a far smaller memory footprint and can run faster on commodity hardware, in addition, these wider models are also more interpretable

I.e., get ~equal performance at lower cost (what's not to like?).

That said, the real issue with this paper is that they only look at very small datasets...which makes the paper basically useless for making grandiose claims like:

> WIDE ATTENTION IS THE WAY FORWARD FOR TRANSFORMERS

That doesn't mean that the paper itself is useless, of course...it is an interesting data point...but they absolutely should not have chosen that title.

Historical_Ad2338 t1_is3636c wrote on October 12, 2022 at 11:30 PM

#89,814

Genuinely shocking, Scaling laws for Neural Language Models figure 6 found that single layers weren't supposed to scale as well (with the same parameters) though ofc the fine details of this new paper are diff.

suflaj t1_is363fo wrote on October 12, 2022 at 11:30 PM

#89,815

Replying to farmingvillein (#89,789)

I didn't mean that it is useless. I find it funny that someone would actually say that instead of "they perform roughly the same". Especially since they do not show that is a statistically significant difference, we have seen your average BERT get much more performance by just rerolling on a different seed.

farmingvillein t1_is36n5p wrote on October 12, 2022 at 11:34 PM

#89,831

Replying to suflaj (#89,815)

Sorry, didn't mean to imply that you were saying that it was useless--that is in response to my own criticism of the paper's title (versus the paper itself).

> I find it funny that someone would actually say that instead of "they perform roughly the same"

Yeah...for better or worse, though, if you say something performs "on parity", people assume (because it is frequently true...) that what you really mean is "-0.1% but that totally isn't a big deal".

I don't fault them for highlighting the 0.3% as a light pushback on the above, but I do blame 1) OP in their post highlighting this point (which, to your point, is at best misleading about the key claims of the paper) and 2) the authors for picking the ludicrous title.

lostmsu t1_is3ge2v wrote on October 13, 2022 at 12:46 AM

#90,228

Paper link: https://arxiv.org/abs/2210.00640

mrpogiface t1_is400t9 wrote on October 13, 2022 at 3:15 AM

#91,129

Replying to Historical_Ad2338 (#89,814)

Yeah, I don't think the OP paper did any scaling experiments, so I'm a bit sceptical long term, but it would be awesome for efficiency if it worked out.

Also, it turns out that the scaling laws in the paper you linked weren't quite right either (a la chinchilla) so who knows, maybe there is something that was missed when you move out of the infinite data regime