manOnPavementWaving t1_itsn0zt wrote
Its actually already stopping, the engineering challenges are getting too big (trends predict 5-10 trillion parameter dense models by now, bet your ass they don't exist), the data available is getting too few, and the other ways to increase performance are way too easy and way too cheap to not focus on.
porcenat_k t1_itsrnjb wrote
"trends predict 5-10 trillion parameter dense models by now, bet your ass they don't exist), the data available is getting too few".
I beg to differ. Indeed, we should expect to see 10 to 20 trillion parameter models this year. Based on industry movements, I'm expecting Meta or Open AI to produce such a model by the end of this year, if not Q1 2023. We don't have enough data for chinchilla compute optimal models. Deep mind scaling laws are flawed in a number of fundamental ways. One of which is that as that sample efficiency, generality and intelligence increases in scale. Large vanilla models require less data in order to achieve better performance. We can train multi trillion parameter dense models with the same or better yet, less data that it took to train gpt 3. It is certainly possible with massive compute clusters running on thousands of A100 gpus to train such a model. Which is exactly what is being done right now. Cheap methods are being focused on right now are a temporary crutch which I'm projected will be put away once firms are able to adopt new gpus such as the H100s.
manOnPavementWaving t1_itsz25o wrote
Wowowow you're seriously questioning the scaling laws of deepmind and going back to the OpenAI ones, which have been demonstrated to be false?
Chain of thought prompting, self consistency, reinforcement learning from human feedback, and data scaling, that's been driving LLM performance lately, noticeably more than scale has. (whilst being significantly cheaper).
Why do you expect such a jump when the industry has been stuck at half a trillion for the past year? All previous jumps were smaller and cost significantly less.
porcenat_k t1_itt4w3g wrote
>Why do you expect such a jump when the industry has been stuck at half a trillion for the past year? All previous jumps were smaller and cost significantly less.
A combination of software and hardware improvements being currently worked on using Nvidia GPUs. https://azure.microsoft.com/en-us/blog/azure-empowers-easytouse-highperformance-and-hyperscale-model-training-using-deepspeed/
With regard to Chinchilla, I don't think they disproved anything. See my comment history if you care enough. I've debated quite extensively on this topic.
manOnPavementWaving t1_itt8bt1 wrote
All I see is comparisons to humans that are by and large unfounded.
justowen4 t1_itt5mpf wrote
It’s simply going to be both scenarios in 2023, quantity and quality, synthetic data variations from existing corpuses with better training distributions (pseudo-sparcity) on optimized hardware. Maybe even some novel chips like photon or analog later next year. It’s like cpus 20 years ago, optimizations all around!
TFenrir t1_itsqltk wrote
I'm curious what trends have been predicting 5-10 trillion parameter models?
And additionally, more recent work has fundamentally increased the value of scaling.
https://twitter.com/YiTayML/status/1583514524836978689?t=Xxm_NYIQvGr5743ZdQzaqA&s=19
You can see that here for example.
But I have heard that finding data is the hard part now, and inference speeds on models in the trillions are going to restrict it's capabilities - but there is a lot of great work being done on inference speed ups.
porcenat_k t1_itssly1 wrote
The size of language models has been growing exponentially. We should expect 100 trillion parameter dense models by next year. https://i0.wp.com/silvertonconsulting.com/wp-content/uploads/2021/04/Screen-Shot-2021-04-15-at-3.18.31-PM.png?ssl=1
I think that is possible once firms begin using h100 gpus.
manOnPavementWaving t1_itt06eo wrote
With H100 the training time optimistically only improves a factor of 9. Not nearly enough to breach the 200x gap between the current largest model and 100 trillion parameter model, and thats in parameter scaling alone, ignoring data scaling. PaLM training took 1200 hours on 6144 tpu v4 chips, and an additional 336 hours on 3072 tpu v4 chips. A 100 trillion parameter model would literally be too big to train before the year 2023 comes to an end.
porcenat_k t1_itt56ff wrote
100 billion parameter models seemed impossible too, back when the size of neural networks was a few million. I'm expecting 10 trillion parameters to be human level AGI.
manOnPavementWaving t1_itt6vrn wrote
That wasn't 1 year before the prediction of a hundred billion parameters though. Im not doubting that they'll come, im doubting the timeline.
Interested in why you think a 10 trillion parameter would be human level AGI.
porcenat_k t1_ituc77f wrote
Artificial neural networks are sufficient mathematical representations of biological cortices. there a huge amount of evidence that concludes this is the case. All that’s left to do is compare human and animal brains to our Ai models. The human brain doesn’t use all 100 trillion parameters on any one task. In fact the brain is divided into regions that allocate compute resources to vision, language, audio etc.. Not even half our brain devotes that many resources to one major region. The upper bound would be 50 trillion parameters. 1 trillion is too small. There aren’t 100 different major cortical regions. There are 10 . All working on the same architecture but processing different modalities. Conservatively 10 trillion parameters are allocated to each major region. Lets take a language model with 10 trillion weights. At that capacity it should be understand language completely. Then, having read all of pub med for example, it would be more knowledgeable than all medical professionals on the planet. A 100 trillion parameter model, I’ve calculated, would be more than a billion times more intelligent than the 10 trillion parameter, in terms of iq, while also having the benefits of of all human knowledge and never being tired and being immortal.
manOnPavementWaving t1_itudq0y wrote
What study shows the equivalence of neural network parameters and connections in the brain? What calculations did you do to to get to "a billion times more intelligent"?
porcenat_k t1_itxcyl3 wrote
https://ai.facebook.com/blog/studying-the-brain-to-build-ai-that-processes-language-as-people-do/
Here is a link to the one of the most recent developments. There are plenty more.
>What calculations did you do to to get to "a billion times more intelligent"?
That's a long discussion based on assumptions I find to be very reasonable. If you insist, I can do go at length. To simplify see the empirical fact that the second most intelligent species, the chimpanzee, has a cortex just 3x smaller than human. The gap intelligence as a result of such an increase is breathtaking. Indeed, quantity leads to vast qualitative leaps. Chimpanzees and gorillas trillions of years from now have no chance of inventing even the simplest tools. If 3x above chimpanzee is human intelligence, what is 10x above human?
manOnPavementWaving t1_ityolvz wrote
They actually do invent tools, but that's not the important thing. What made humans intelligent is having a big brain, and having lots of time. If we were to put a newborn and a baby chimpanzee in a jungle and monitor them, they wouldn't seem all that different regarding intelligence.
Fine if you take that into your calculations, but it can't be attributed to just the bigger brain. Problem being, the 100 trillion parameter model won't have hundreds of thousands of years, and billions of copies of itself.
Cool reference, though! Interesting work
[deleted] t1_itt05b3 wrote
[deleted]
SgathTriallair t1_ittiv7r wrote
Having difficulty getting the data or physically building the model doesn't mean that the accuracy gains from such a model are diminishing.
That is equivalent to saying asking how fast cars can theoretically go before they fall apart and responding that the speed limit is 65.
It may be difficult to build a 10 trillion parameter model but that doesn't mean it wouldn't be more effective.
TopicRepulsive7936 t1_itt5ek0 wrote
You didn't answer the question.
manOnPavementWaving t1_itt6ptg wrote
Its an implicit no in the sense that scaling is already slowing
ReasonablyBadass t1_ittrx0o wrote
No? There have been a lot of developments of getting results with snaller models though. Basically people figured out ways to not need to train such huge modeks. Which means the bigger models will now be even better. But the focus currently is figuring out how to get the most out of current sizes.
Qumeric t1_itywzd3 wrote
We don't have enough data and compute to make 5 trillion models economically feasible. It just doesn't make sense. It's better to create 500B model and train it properly.
manOnPavementWaving t1_itz2ds5 wrote
I agree, but you'll find yourself to be a stranger in this thread
Viewing a single comment thread. View all comments