Submitted by Angry_Grandpa_ t3_y92cl1 in singularity
If we want to build an AI model based on everything ever said on YouTube then according to my calculations a model with 770 billion parameters and 15.7 trillion tokens would be sufficient.
This assumes all of the audio is converted to text and an average speech speed of 100 words per minute. 500 hours of content is uploaded to YouTube every minute. It’s probably larger than what we’d need since I assumed 10 years at 500 hours and YouTube didn’t hit 500 hours per minute until 2019.
1,000 words is approximately 750 tokens. So it’s possible the token number needs to be a little higher (19.6 trillion).
This is based on the optimized parameter to token ratio in the Chincilla model.
Source: https://arxiv.org/abs/2203.15556
Based on the public pricing on Mosaic cloud the current cost is $2.5 million for a 1.4 trillion token model. So the worst case scenario would be a mere $35 million for a YouTube large language model.
Source: https://www.mosaicml.com/blog/gpt-3-quality-for-500k
Is it worth it?
Has Google already done it?
manOnPavementWaving t1_it393xz wrote
my man you cant just scale cost with number of tokens and not number of parameters
way too many mostly false assumptions in these calculations