Submitted by enryu42 t3_122ppu0 in MachineLearning
cegras t1_jdsd89g wrote
I don't see how it is possible to not end up just memorizing the internet, which is full of enough questions and discussions to simulate convincing Q&As. Consider if a team had invented an algorithm or heuristic to avoid data contamination (https://aisnakeoil.substack.com/p/gpt-4-and-professional-benchmarks). Then what you have is something that can separate content into logically similar, but orthogonal realizations. That would be an incredibe tool and worth a prize in its own right.
pengo t1_jdt6iv2 wrote
> Then what you have is something that can separate content into logically similar, but orthogonal realizations.
Like a word vector? The thing every language model is based on?
cegras t1_jdta9mj wrote
More like, the ability to know that 'reversing a linked list' and 'linked list cycle and traversal problems' are the same concepts but different problems, and to separate those into train/test. Clearly they haven't figured that out because ChatGPT is contaminated, and their (opaquely disclosed) ways of addressing that issue don't seem adequate at all.
Viewing a single comment thread. View all comments