Submitted by SuchOccasion457 t3_y2i7h1 in MachineLearning
suflaj t1_is363fo wrote
Reply to comment by farmingvillein in [D] Wide Attention Is The Way Forward For Transformers by SuchOccasion457
I didn't mean that it is useless. I find it funny that someone would actually say that instead of "they perform roughly the same". Especially since they do not show that is a statistically significant difference, we have seen your average BERT get much more performance by just rerolling on a different seed.
farmingvillein t1_is36n5p wrote
Sorry, didn't mean to imply that you were saying that it was useless--that is in response to my own criticism of the paper's title (versus the paper itself).
> I find it funny that someone would actually say that instead of "they perform roughly the same"
Yeah...for better or worse, though, if you say something performs "on parity", people assume (because it is frequently true...) that what you really mean is "-0.1% but that totally isn't a big deal".
I don't fault them for highlighting the 0.3% as a light pushback on the above, but I do blame 1) OP in their post highlighting this point (which, to your point, is at best misleading about the key claims of the paper) and 2) the authors for picking the ludicrous title.
Viewing a single comment thread. View all comments