Submitted by SuchOccasion457 t3_y2i7h1 in MachineLearning
farmingvillein t1_is35lpb wrote
Reply to comment by suflaj in [D] Wide Attention Is The Way Forward For Transformers by SuchOccasion457
Well, the key claim of the paper (which OP should have instead reflected in the top-level post) is not that there is a big accuracy increase, but that performance is equal or better, while being computationally advantaged:
> We show an in-depth evaluation and demonstrate how wide models require a far smaller memory footprint and can run faster on commodity hardware, in addition, these wider models are also more interpretable
I.e., get ~equal performance at lower cost (what's not to like?).
That said, the real issue with this paper is that they only look at very small datasets...which makes the paper basically useless for making grandiose claims like:
> WIDE ATTENTION IS THE WAY FORWARD FOR TRANSFORMERS
That doesn't mean that the paper itself is useless, of course...it is an interesting data point...but they absolutely should not have chosen that title.
suflaj t1_is363fo wrote
I didn't mean that it is useless. I find it funny that someone would actually say that instead of "they perform roughly the same". Especially since they do not show that is a statistically significant difference, we have seen your average BERT get much more performance by just rerolling on a different seed.
farmingvillein t1_is36n5p wrote
Sorry, didn't mean to imply that you were saying that it was useless--that is in response to my own criticism of the paper's title (versus the paper itself).
> I find it funny that someone would actually say that instead of "they perform roughly the same"
Yeah...for better or worse, though, if you say something performs "on parity", people assume (because it is frequently true...) that what you really mean is "-0.1% but that totally isn't a big deal".
I don't fault them for highlighting the 0.3% as a light pushback on the above, but I do blame 1) OP in their post highlighting this point (which, to your point, is at best misleading about the key claims of the paper) and 2) the authors for picking the ludicrous title.
Viewing a single comment thread. View all comments