data-drone t1_j62n3b9 wrote on January 27, 2023 at 7:15 AM

How much more training they need?

CKtalon t1_j62n9yw wrote on January 27, 2023 at 7:17 AM

About 10-12 times more then the tokens seen.

NoFairYouCheated t1_j68z10h wrote on January 28, 2023 at 4:34 PM

Are there any papers or blog posts discussing this undertraining?

CKtalon t1_j695owv wrote on January 28, 2023 at 5:20 PM

No. There are blog posts about it performing quite badly: https://www.surgehq.ai/blog/how-good-is-hugging-faces-bloom-a-real-world-human-evaluation-of-language-models

Then based on the Chinchilla paper, you can kind of infer that it's a result of undertraining.

maizeq t1_j66b3l5 wrote on January 28, 2023 at 12:39 AM

Chinchilla (70B) is trained with 1.4 trillion, so 140B would presumably need at least 2.8 trillion (it scales linearly afaik).

I’m not sure a 2.8 trillion token dataset actually exists

rainy_moon_bear t1_j676oo9 wrote on January 28, 2023 at 4:59 AM

This is something people don't seem to understand. Pretty much all models 100B+ are undertrained.

Taenk t1_j688cev wrote on January 28, 2023 at 1:00 PM

> I’m not sure a 2.8 trillion token dataset actually exists

DeepMind's Massive Text is assumed to be 10TB large, the largest publically available dataset is The Pile and weighs in at about 820GB.

A 2.8 trillion token dataset would need to be more than 20TB large, which could be possible by including more of Common Crawl - weighing in at 380TiB - or non-English resources. I have a suspicion that training LLMs on more languages, especially outside of the Indo-European family, will improve performance within the Indo-European family.

maizeq t1_j69vuec wrote on January 28, 2023 at 8:18 PM

Nice. How are you converting between dataset size and number of tokens?

Doesn’t common crawl get deduplicated and that’s why the number of usable tokens decreases - or is it also curation? How much of that 380TiB is actually utilisable.

Given the ostensibly impressive performance of the bilingual GLM-130B (Chinese+English) model that came out of Tsinghua university that might very well be the case.

lookatmetype t1_j64nstm wrote on January 27, 2023 at 6:07 PM

To be fair, most of the weights in every "Foundation" model are useless.

flashdude64 t1_j65z2q4 wrote on January 27, 2023 at 11:12 PM

Do you have a citation for this that I could read?

[R] SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot

CKtalon t1_j62hmsr wrote on January 27, 2023 at 6:11 AM