Submitted by SuchOccasion457 t3_y2i7h1 in MachineLearning
"On average, across 4 NLP tasks and 10 attention types, single layer wide models perform 0.3% better than their deep counterparts"
Discussions of some wide attention results
https://twitter.com/andrey_kurenkov/status/1579474438822985728
suflaj t1_is327qu wrote
I had to check that I'm not on a satire sub comparing the title and that quote.