https://preview.redd.it/qq7cl89iizu91.jpg?width=1478&format=pjpg&auto=webp&s=51c121595ef50f0ce6dcfc8f70d16e8f3ae27437

If we want to build an AI model based on everything ever said on YouTube then according to my calculations a model with 770 billion parameters and 15.7 trillion tokens would be sufficient.

This assumes all of the audio is converted to text and an average speech speed of 100 words per minute. 500 hours of content is uploaded to YouTube every minute. It’s probably larger than what we’d need since I assumed 10 years at 500 hours and YouTube didn’t hit 500 hours per minute until 2019.

1,000 words is approximately 750 tokens. So it’s possible the token number needs to be a little higher (19.6 trillion).

This is based on the optimized parameter to token ratio in the Chincilla model.

Source: https://arxiv.org/abs/2203.15556

Based on the public pricing on Mosaic cloud the current cost is $2.5 million for a 1.4 trillion token model. So the worst case scenario would be a mere $35 million for a YouTube large language model.

Source: https://www.mosaicml.com/blog/gpt-3-quality-for-500k

Is it worth it?

Has Google already done it?

Comments

You must log in or register to comment.

manOnPavementWaving t1_it393xz wrote on October 20, 2022 at 4:45 PM

my man you cant just scale cost with number of tokens and not number of parameters

way too many mostly false assumptions in these calculations

Angry_Grandpa_ OP t1_it4qmjp wrote on October 20, 2022 at 10:38 PM

It's based on the Chinchilla paper and not my personal opinion. You should read the paper.

If you think the assumptions are wrong you should do your own projections based on the paper.

manOnPavementWaving t1_it4r8la wrote on October 20, 2022 at 10:42 PM

I have read the paper, which is how I know that they scale data and parameters equally, meaning a 10x in data results in a 100x in compute required and hence a 100x in cost.

Assumptions wise Im looking more at the number of words on youtube, your estimate is likely wildly off.

Youre also ignoring that the training time could very well be long enough that it would be a better strategy to wait for better GPUs to come out.

Angry_Grandpa_ OP t1_it50h5j wrote on October 20, 2022 at 11:53 PM

What is your estimate?

LeroyJanky80 t1_it6g9ub wrote on October 21, 2022 at 7:45 AM

The data Google/Alphabet has is obviously it's most powerful asset. My guess is they've done this, have the means, brain trust, wealth and capacity to do this. They can easily cover this in all domains where people, infrastructure and content are concerned. It's a massive endeavour but so is what they did with the entire internet many many years ago and at the time it was groundbreaking.

ReasonablyBadass t1_it3dvxo wrote on October 20, 2022 at 5:16 PM

I mean, why? We already have large text corpora. The whole point of youtube is visual data, no?

visarga t1_it4lygh wrote on October 20, 2022 at 10:03 PM

Visual data can be described in text, and maybe it's better to do so in order to avoid overfitting to irrelevant details. We have great captioning models for image and video, so we can use them together with speech recognition models. Just imagine a model trained on YT videos playing the sports commentator role - wouldn't it be great to have a virtual commentator for your vids?

But I am excited about training on massive video because it is special - it contains a trove of procedural knowledge, how to do things, step by step. That means you can finetune it later to automate anything you want. Your clumsy robot just got GPT-3 level smarts in practical tasks rarely described in words anywhere.

There was a recent paper - with just 5 hours of robot video and proprioception they trained a transformer to manipulate a toy kitchen and achieve tasks. Pretty amazing, considering The Wozniak threshold of AI: a robot enters a random kitchen and has to make a cup of coffee. There are millions of kitchens on YT, millions of everything in fact.

Looks like "learning to act" is going to be very successful, just like learning to generate text and images. Maybe the handymen won't be the last to be automated.

wildbearsoftware t1_it4axsn wrote on October 20, 2022 at 8:49 PM

On that, I listen to YouTube far more than I watch. Watching is for context but I'm 80% of the time listening rather than watching.

ScionoftheToad t1_it3ofwm wrote on October 20, 2022 at 6:24 PM

An AI trained off of Youtube comments would be one of the most toxic things imaginable.

Angry_Grandpa_ OP t1_it4f1tc wrote on October 20, 2022 at 9:16 PM

The comments weren't included -- just the Youtube videos themselves. I agree that the utility of the comments would be very close to nil.

visarga t1_it4q1uj wrote on October 20, 2022 at 10:34 PM

Train a comment filter. Some comments are great, it depends on the topic very much. In fact, scrap that! Do a GPT4chan and train on the real YT comments. Then instruction-tune the model to be polite. Better to be a polite model but know all the shitty stuff too, to get the jokes.

Onyourknees__ t1_it4poyr wrote on October 20, 2022 at 10:31 PM

Spreading dissent is big business.

Nmanga90 t1_it3brwl wrote on October 20, 2022 at 5:02 PM

In addition to what the other guy said, very bold to assume google has not been actively doing this for years

Dreason8 t1_it5xes0 wrote on October 21, 2022 at 4:07 AM

90% of the data: "don't forget to like, subscribe, and smash that notification button"

MrTacobeans t1_it6lf1h wrote on October 21, 2022 at 8:59 AM

I am once again asking you to smash that like button

newDeckardCain t1_it3ihks wrote on October 20, 2022 at 5:45 PM

This is interesting something that stability.ai should do. A further interesting iteration of this would be to associate an image i.e. the current frame in the video to the token and maybe that prompts the model to also have a world model.

Like what Yan LeCun has been advocating for.

visarga t1_it4qoq7 wrote on October 20, 2022 at 10:38 PM

After text, image and video (+ audio) I think we got all the bases covered. Nobody can claim AI is not grounded anymore. And with this grounding comes a nuanced, semantic understanding of the world. It's like an upload, but not of a person, the whole culture gets to be uploaded at once.

2Punx2Furious t1_it3skyu wrote on October 20, 2022 at 6:51 PM

> Has Google already done it?

Probably.

adt t1_it4mwgl wrote on October 20, 2022 at 10:10 PM

Related, different calcs: https://youtu.be/fORng0zjXQQ

jotmoney t1_it5lqvb wrote on October 21, 2022 at 2:30 AM

What’s the estimate of errors on translation

Angry_Grandpa_ OP t1_it5o31d wrote on October 21, 2022 at 2:48 AM

Good question. I think there is still work to be done. The goal would be 99%... but it looks like the best models aren't quite there yet.

Here is an article on the topic: https://www.nature.com/articles/s41598-022-12260-y

BinyaminDelta t1_it9ci16 wrote on October 21, 2022 at 9:54 PM

Pretty sure all or most of the audio on YouTube has ALREADY been transcribed.