EmmyNoetherRing t1_j6j7zq4 wrote on January 30, 2023 at 6:46 PM

Reply to comment by mettle in [Discussion] ChatGPT and language understanding benchmarks by mettle

I wouldn’t mind being one of those folks. But you make a good point that the old rubrics may not be capturing it.

If you want to nail down what users are observing as its comparison to human performance, practically speaking you may need to shift to diagnostics that were designed to evaluate human performance. With the added challenge of avoiding tests where the answer sheet would already be in its training data.