Blog: https://epochai.org/blog/will-we-run-out-of-ml-data-evidence-from-projecting-dataset

Abstract:

>We analyze the growth of dataset sizes used in machine learning for natural language processing and computer vision, and extrapolate these using two methods; using the historical growth rate and estimating the compute-optimal dataset size for future predicted compute budgets. We investigate the growth in data usage by estimating the total stock of unlabeled data available on the internet over the coming decades. Our analysis indicates that the stock of high-quality language data will be exhausted soon; likely before 2026. By contrast, the stock of low-quality language data and image data will be exhausted only much later; between 2030 and 2050 (for low-quality language) and between 2030 and 2060 (for images). Our work suggests that the current trend of ever-growing ML models that rely on enormous datasets might slow down if data efficiency is not drastically improved or new sources of data become available.

Possible solutions based on the following papers:

https://arxiv.org/abs/2112.04426 , https://arxiv.org/abs/2111.00210 and https://openreview.net/forum?id=NiEtU7blzN / Retrival machanisms, EfficientZero and synthetic data can be seen as possible solutions that need to be improved on.

https://preview.redd.it/5tji6jd60e0a1.jpg?width=1559&format=pjpg&auto=webp&s=d7b5e5dbe6836fc0a59a17281cb7e2ea20e56727

https://preview.redd.it/qgsmjod60e0a1.jpg?width=1544&format=pjpg&auto=webp&s=d949c561f4a006791fecaf56bd155265b4580389

https://preview.redd.it/0zwq9ld60e0a1.jpg?width=1200&format=pjpg&auto=webp&s=808d578f3ac19ca4556830c21646d90132687918

Comments

CatalyzeX_code_bot t1_iwnbh16 wrote on November 16, 2022 at 10:20 PM

#549,109

Found relevant code at https://github.com/YeWR/EfficientZero + all code implementations here

To opt out from receiving code links, DM me

ReasonablyBadass t1_iwnbmrx wrote on November 16, 2022 at 10:21 PM

#549,121

AFAIK most LLMs don't even use one epoch?

lostmsu t1_iwnoxf0 wrote on November 16, 2022 at 11:59 PM

#549,913

Have they mentioned Efficient Zero?

I think the author is severely behind of the current SOTA.

Singularian2501 OP t1_iwnpy8m wrote on November 17, 2022 at 12:07 AM

#549,978

Replying to lostmsu (#549,913)

Yes they mentioned it at the end of their blog article. But I think it was only meant as an example how better sample efficiency could be achieved and not SOTA related.

13ass13ass t1_iwo4lan wrote on November 17, 2022 at 2:02 AM

#550,810

Replying to Singularian2501 (#549,978)

Efficient zero is for RL with atari games though. How does it apply to things like large language models?

[deleted] t1_iwo4us9 wrote on November 17, 2022 at 2:04 AM

#550,837

Replying to 13ass13ass (#550,810)

[deleted]

TheRealSerdra t1_iwo4w46 wrote on November 17, 2022 at 2:04 AM

#550,841

Replying to ReasonablyBadass (#549,121)

Technically aren’t you always doing at least one epoch? You’re doing one pass through of all your data at least, even if that data is less than the amount you theoretically could use

ktpr t1_iwode1v wrote on November 17, 2022 at 3:13 AM

#551,280

What’s wrong with self supervision? It enables combinatorial expansion of dataset sizes if the task is specified well.

ReasonablyBadass t1_iwoq0ug wrote on November 17, 2022 at 5:08 AM

#552,042

Replying to TheRealSerdra (#550,841)

Not a complete one. GPT-3,I think, didn't complete it's first pass-through

londons_explorer t1_iwp5r0a wrote on November 17, 2022 at 8:23 AM

#552,672

There is a lot more data that could be used in the form of private communications (for example all iMessage chats), if only the ethical and legal side could be sorted out.

zzzthelastuser t1_iwpi7r5 wrote on November 17, 2022 at 11:26 AM

#553,255

Replying to ReasonablyBadass (#552,042)

You could argue GPT-3 was trained on a subset of the available training data, no?

Not completing the first pass-through means the remaining data could be considered as not part of the training data.

ReasonablyBadass t1_iwplk0c wrote on November 17, 2022 at 12:06 PM

#553,461

Replying to zzzthelastuser (#553,255)

Semantics. It didn't see any of it's data more than once and it had more available. Not one full epoch.

zzzthelastuser t1_iwpltkw wrote on November 17, 2022 at 12:09 PM

#553,476

Replying to ReasonablyBadass (#553,461)

Sure, but in theory my little Hello World network had also more data available on the internet.

Singularian2501 OP t1_iwq1iph wrote on November 17, 2022 at 2:29 PM

#554,533

Replying to lostmsu (#549,913)

https://www.lesswrong.com/posts/mRwJce3npmzbKfxws/efficientzero-how-it-works

A lesswrong article I have found that explains how efficient zero works.

In my opinion the author wants to say that systems like efficient zero are more efficient in their data usage and could be used for llm also to increase their sample efficiency.

To be honest I hope that my post gets so much attention that the author of the paper can answer our questions.

lostmsu t1_iws6anl wrote on November 17, 2022 at 11:08 PM

#559,894

Replying to 13ass13ass (#550,810)

The point is there are many models that use the same technique.

leondz t1_ix96ivb wrote on November 21, 2022 at 6:33 PM

#604,014

We already did for most languages that aren't English. Data efficiency is the only way to catch up, for them.

leondz t1_ix96sfz wrote on November 21, 2022 at 6:34 PM

#604,043

Replying to ReasonablyBadass (#549,121)

Yeah, this gives you an idea of how little of the data is actually worth going through - most of it repeats structures found elsewhere in the data, and isn't very diverse. Going through huge low-curation datasets is inefficient: the data diversity just isn't there.

bloc97 t1_ixjuivv wrote on November 23, 2022 at 11:51 PM

#640,879

This can be considered good news. If all data is exhausted people will be actually forced to research better data-efficient algorithms. We humans don't ingest 100 GBs of arXiv papers to do research and we don't need billions of images to paint a cat sitting on a sofa. Until we figure out how to run GPT-3 on smartphones (maybe using neuromorphic computing?), we shouldn't be too worried about the trend of using bigger and bigger datasets, because small(er) networks can be successfully trained without that much data.

[R] Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning - Epochai Pablo Villalobos et al - Trend of ever-growing ML models might slow down if data efficiency is not drastically improved!

Comments

CatalyzeX_code_bot t1_iwnbh16 wrote on November 16, 2022 at 10:20 PM

ReasonablyBadass t1_iwnbmrx wrote on November 16, 2022 at 10:21 PM

lostmsu t1_iwnoxf0 wrote on November 16, 2022 at 11:59 PM

Singularian2501 OP t1_iwnpy8m wrote on November 17, 2022 at 12:07 AM

13ass13ass t1_iwo4lan wrote on November 17, 2022 at 2:02 AM

[deleted] t1_iwo4us9 wrote on November 17, 2022 at 2:04 AM

TheRealSerdra t1_iwo4w46 wrote on November 17, 2022 at 2:04 AM

ktpr t1_iwode1v wrote on November 17, 2022 at 3:13 AM

ReasonablyBadass t1_iwoq0ug wrote on November 17, 2022 at 5:08 AM

londons_explorer t1_iwp5r0a wrote on November 17, 2022 at 8:23 AM

zzzthelastuser t1_iwpi7r5 wrote on November 17, 2022 at 11:26 AM

ReasonablyBadass t1_iwplk0c wrote on November 17, 2022 at 12:06 PM

zzzthelastuser t1_iwpltkw wrote on November 17, 2022 at 12:09 PM

Singularian2501 OP t1_iwpzwii wrote on November 17, 2022 at 2:17 PM

Singularian2501 OP t1_iwq1iph wrote on November 17, 2022 at 2:29 PM

lostmsu t1_iws6anl wrote on November 17, 2022 at 11:08 PM

leondz t1_ix96ivb wrote on November 21, 2022 at 6:33 PM

leondz t1_ix96sfz wrote on November 21, 2022 at 6:34 PM

bloc97 t1_ixjuivv wrote on November 23, 2022 at 11:51 PM