Viewing a single comment thread. View all comments

UseNew5079 t1_je9wrw6 wrote

Check LLama paper: https://arxiv.org/pdf/2302.13971.pdf

Specifically this graph: https://paste.pics/6f817f0aa71065e155027d313d70f18c

They increase performance (reduce loss) with parameters or training time. More parameters just allow for faster and deeper initial drop in error/loss but later part looks the same for all model sizes. At least that is my interpretation.

1