CKtalon t1_jbaogg3 wrote on March 7, 2023 at 6:06 PM

#2,179,813

Chinchilla just says that for a given compute, what is the optimal amount of data to train on to give the best bang for your buck. It doesn’t mean that the model converges to ‘best performance’ once it reaches the Chinchilla-optimal token count. Ergo, you can keep training if you have plenty of budget

Maximum OP t1_jbb5bzm wrote on March 7, 2023 at 7:53 PM

#2,180,473

Replying to CKtalon (#2,179,813)

Right, I just noticed that LLaMA says they didn't fix their compute. Thanks. I wonder if there is a small architecture that is trained until convergence.

_Arsenie_Boca_ t1_jbbh5ng wrote on March 7, 2023 at 9:07 PM

#2,181,008

Replying to __Maximum__ (#2,180,473)

Until convergence is something that we often say and hear but makes no sense by definition. Convergence never ends

Maximum OP t1_jbbi89l wrote on March 7, 2023 at 9:14 PM

#2,181,054

Replying to _Arsenie_Boca_ (#2,181,008)

Until looking at loss does not get you excited?

currentscurrents t1_jbbmmqs wrote on March 7, 2023 at 9:41 PM

#2,181,263

Replying to _Arsenie_Boca_ (#2,181,008)

Eventually you can reach a point where any possible change to the model decreases performance. Then you've fully converged.

Nobody ever does this though because of diminishing returns.

adt t1_jbbzba8 wrote on March 7, 2023 at 11:06 PM

#2,181,821

Replying to __Maximum__ (#2,180,473)

There are a few that 'feel' that way. Try Megatron-11B (~200:1) based on RoBERTa (6,198:1). Wayyyyy ahead of its time, and I've matched it with much larger models in some testing.

https://app.inferkit.com/demo

Here's the full table of Chinchilla-align comparisons:

https://lifearchitect.ai/models-table/

whata_wonderful_day t1_jbcxdwf wrote on March 8, 2023 at 3:23 AM

#2,183,341

Replying to adt (#2,181,821)

Nice! How did you get access to Megatron-11B? I can't find it online anywhere

Taenk t1_jbdidpy wrote on March 8, 2023 at 6:51 AM

#2,184,121

Replying to CKtalon (#2,179,813)

Can you rephrase that a little bit? Does it mean that Chinchilla answers „assuming that you have one Teraflop of compute time, use 20 tokens of data per parameter of model, then you hit diminishing returns in the sense that you could train another model from scratch faster“ and LLaMA answers „assuming you want optimal performance at inference time, regardless of compute budget, even small models can benefit from larger datasets“?

CKtalon t1_jbdjaxa wrote on March 8, 2023 at 7:02 AM

#2,184,138

Replying to Taenk (#2,184,121)

Instead of choosing a huge model and having it undertrained due to limited compute budget, choose the small but biggest model for your compute budget using their estimates. It doesn’t necessarily mean that a small model trained with larger datasets will naturally beat a bigger model.

Maximum OP t1_jbdqy5c wrote on March 8, 2023 at 8:47 AM

#2,184,329

Replying to adt (#2,181,821)

Thanks for the links. Looks like RoBERTa did not gain a lot from the additional trainings, only minor improvements, but yeah, it was a tiny model. How was this not a good lesson? Why did people need Chinchilla? Maybe it's just having a lot of data comes easy so people gather as much as possible, even though they know they will go maximum 1 epoch over it.

Maximum OP t1_jbdr6zj wrote on March 8, 2023 at 8:50 AM

#2,184,333

Replying to Taenk (#2,184,121)

Not quite. Assuming you have certain compute, if you have a model with 1B parameters, then use a dataset of 20B tokens. Look at the figures in Chinchilla paper, they demonstrate it nicely.

Jepacor t1_jbdrovb wrote on March 8, 2023 at 8:57 AM

#2,184,341

Replying to whata_wonderful_day (#2,183,341)

The link to the model is in the Google sheets they linked : https://github.com/facebookresearch/fairseq/blob/main/examples/megatron_11b/README.md

blarg7459 t1_jbetts9 wrote on March 8, 2023 at 3:23 PM

#2,185,934

Replying to CKtalon (#2,179,813)

Doesn't that mean that if you include inference costs, and the model will be used extensively, you may actually get much better bang for your bucks by training much more than chinchilla-optimal?

cztomsik t1_jbgdoar wrote on March 8, 2023 at 9:17 PM

#2,188,326

Replying to currentscurrents (#2,181,263)

but this is likely going to take forever because of LR decay, right?

whata_wonderful_day t1_jbhp4gb wrote on March 9, 2023 at 2:53 AM

#2,190,337

Replying to Jepacor (#2,184,341)

Thanks, alas I thought it was an encoder model. I've been on the lookout for a big one, largest I've seen is deberta V2 with 1.5B params

[deleted] t1_jbio7jf wrote on March 9, 2023 at 9:10 AM

#2,191,841

[removed]

farmingvillein t1_jbk2uyw wrote on March 9, 2023 at 4:45 PM

#2,194,070

Replying to currentscurrents (#2,181,263)

> Nobody ever does this though because of diminishing returns.

Extending the LLaMa concept, I would love to see someone like Meta run the experiment where they do take their 1.4T (or w/e) tokens, and run training to convergence...on the largest model that will converge (subject to reasonable LR decay policies) in a "reasonable" time frame.

Meaning, if they trained, say, a 1M param LLM...presumably it would hit convergence (get saturated) pretty quickly. And what about 10M, 100M, etc.?

I.e., how much more can we squeeze out of a relatively-tiny model? Probably it doesn't end up super interesting from a purely generative POV, but it might look like--e.g.--Roberta+.

With a model that is so small, the cost to run this test probably(?) wouldn't be that high.

farmingvillein t1_jbk3esu wrote on March 9, 2023 at 4:49 PM

#2,194,099

Replying to blarg7459 (#2,185,934)

Yes, which was arguably the key claim of the LLaMa paper.

[D] Can someone explain the discrepancy between the findings of LLaMA and Chinchilla?

Comments

CKtalon t1_jbaogg3 wrote on March 7, 2023 at 6:06 PM

Maximum OP t1_jbb5bzm wrote on March 7, 2023 at 7:53 PM

_Arsenie_Boca_ t1_jbbh5ng wrote on March 7, 2023 at 9:07 PM

Maximum OP t1_jbbi89l wrote on March 7, 2023 at 9:14 PM

currentscurrents t1_jbbmmqs wrote on March 7, 2023 at 9:41 PM

adt t1_jbbzba8 wrote on March 7, 2023 at 11:06 PM

whata_wonderful_day t1_jbcxdwf wrote on March 8, 2023 at 3:23 AM

Taenk t1_jbdidpy wrote on March 8, 2023 at 6:51 AM

CKtalon t1_jbdjaxa wrote on March 8, 2023 at 7:02 AM

Maximum OP t1_jbdqy5c wrote on March 8, 2023 at 8:47 AM

Maximum OP t1_jbdr6zj wrote on March 8, 2023 at 8:50 AM

Jepacor t1_jbdrovb wrote on March 8, 2023 at 8:57 AM

blarg7459 t1_jbetts9 wrote on March 8, 2023 at 3:23 PM

cztomsik t1_jbgdoar wrote on March 8, 2023 at 9:17 PM

whata_wonderful_day t1_jbhp4gb wrote on March 9, 2023 at 2:53 AM

[deleted] t1_jbio7jf wrote on March 9, 2023 at 9:10 AM

farmingvillein t1_jbk2uyw wrote on March 9, 2023 at 4:45 PM

farmingvillein t1_jbk3esu wrote on March 9, 2023 at 4:49 PM