Submitted by BB4evaTB12 t3_zff5mh in MachineLearning
Continuing my analysis of errors in widely-used LLM benchmarks (post on Google's GoEmotions here) — I analyzed HellaSwag and found 36% contains errors.
For example, here's a prompt and set of possible completions from the dataset. Which completion do you think is most appropriate? See if you can figure it out through the haze of typos and generally non-sensical writing.
Men are standing in a large green field playing lacrosse. People is around the field watching the game. men
- are holding tshirts watching int lacrosse playing.
- are being interviewed in a podium in front of a large group and a gymnast is holding a microphone for the announcers.
- are running side to side of the ield playing lacrosse trying to score.
- are in a field running around playing lacrosse.
I'll keep it spoiler-free here, but the full blog post goes into detail on this example (and others) and explains why they are so problematic.
Link: https://www.surgehq.ai/blog/hellaswag-or-hellabad-36-of-this-popular-llm-benchmark-contains-errors
leondz t1_izckeuh wrote
This happens all the time and it's awful. Please put this up on arXiv.