Submitted by blacklemon67 t3_11misax in MachineLearning
Hey everyone!
A quick fermi estimate shows that if a person were to encounter 50,000 tokens a day (extremely high estimate, this is a novel per day assuming 1 token = 1 word) then by the time they are 20 they would have encountered 365 million tokens.
Obviously this person would be VERY well read. However, if we feed a transformer language model with the same number of tokens then according to scaling laws it would be worse than gpt-2 (which was trained with a dataset about an order of magnitude larger).
So the question is, why do language models need so many tokens? Does anyone know of any review papers/blog posts discussing this observation?
My theory is that we haven't yet found the most efficient architecture for language yet, and that transformers' ability to excell at many different tasks means that you need to give it a lot of data to force it to come up with the right neural circuits for the job.
TLDR: Humans need substantially fewer tokens than transformer language models. What's the current understanding for why this is?
Acrobatic-Name5948 t1_jbjdcae wrote
If anyone knew this we would be created AGI already. Probably scale issues and some new ideas on top of deep learning.