Viewing a single comment thread. View all comments

visarga t1_j6n5mgc wrote

Oh, yes, gladly. This "open"AI paper says it:

> Larger models are significantly more sample efficient, such that optimally compute efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.

https://arxiv.org/abs/2001.08361

You can improve outcomes from small datasets by making the model larger.

1

andreichiffa t1_j6n9lg6 wrote

A lot of the conclusions from that paper has been called into question by the discovery GPT-2 was actually memorizing a lot of information from the training dataset a little less than a year later: https://arxiv.org/abs/2012.07805

About a year after that Anthropic came out with a paper that suggested that there were scaling laws that meant undertrained larger models did not that much better and actually did need more data: https://arxiv.org/pdf/2202.07785.pdf

Finally, more recent results from DeepMind did an additional pass on the topic and seem to suggest that the relationship between the data and model size is much more tight than anticipated and that a 4x smaller model trained for 4x the time would out-perform the larger model: https://arxiv.org/pdf/2203.15556.pdf

Basically the original OpenAI paper did contradict a lot of prior research on overfitting and generalization and seems to be due to a Simpson paradox instance on some of the batching they were doing.

1