porcenat_k t1_itsrnjb wrote on October 26, 2022 at 1:00 AM

"trends predict 5-10 trillion parameter dense models by now, bet your ass they don't exist), the data available is getting too few".

I beg to differ. Indeed, we should expect to see 10 to 20 trillion parameter models this year. Based on industry movements, I'm expecting Meta or Open AI to produce such a model by the end of this year, if not Q1 2023. We don't have enough data for chinchilla compute optimal models. Deep mind scaling laws are flawed in a number of fundamental ways. One of which is that as that sample efficiency, generality and intelligence increases in scale. Large vanilla models require less data in order to achieve better performance. We can train multi trillion parameter dense models with the same or better yet, less data that it took to train gpt 3. It is certainly possible with massive compute clusters running on thousands of A100 gpus to train such a model. Which is exactly what is being done right now. Cheap methods are being focused on right now are a temporary crutch which I'm projected will be put away once firms are able to adopt new gpus such as the H100s.

manOnPavementWaving t1_itsz25o wrote on October 26, 2022 at 1:57 AM

Wowowow you're seriously questioning the scaling laws of deepmind and going back to the OpenAI ones, which have been demonstrated to be false?

Chain of thought prompting, self consistency, reinforcement learning from human feedback, and data scaling, that's been driving LLM performance lately, noticeably more than scale has. (whilst being significantly cheaper).

Why do you expect such a jump when the industry has been stuck at half a trillion for the past year? All previous jumps were smaller and cost significantly less.

porcenat_k t1_itt4w3g wrote on October 26, 2022 at 2:44 AM

>Why do you expect such a jump when the industry has been stuck at half a trillion for the past year? All previous jumps were smaller and cost significantly less.

A combination of software and hardware improvements being currently worked on using Nvidia GPUs. https://azure.microsoft.com/en-us/blog/azure-empowers-easytouse-highperformance-and-hyperscale-model-training-using-deepspeed/

With regard to Chinchilla, I don't think they disproved anything. See my comment history if you care enough. I've debated quite extensively on this topic.

manOnPavementWaving t1_itt8bt1 wrote on October 26, 2022 at 3:14 AM

All I see is comparisons to humans that are by and large unfounded.

justowen4 t1_itt5mpf wrote on October 26, 2022 at 2:51 AM

It’s simply going to be both scenarios in 2023, quantity and quality, synthetic data variations from existing corpuses with better training distributions (pseudo-sparcity) on optimized hardware. Maybe even some novel chips like photon or analog later next year. It’s like cpus 20 years ago, optimizations all around!

TFenrir t1_itsqltk wrote on October 26, 2022 at 12:52 AM

I'm curious what trends have been predicting 5-10 trillion parameter models?

And additionally, more recent work has fundamentally increased the value of scaling.

https://twitter.com/YiTayML/status/1583514524836978689?t=Xxm_NYIQvGr5743ZdQzaqA&s=19

You can see that here for example.

But I have heard that finding data is the hard part now, and inference speeds on models in the trillions are going to restrict it's capabilities - but there is a lot of great work being done on inference speed ups.

porcenat_k t1_itssly1 wrote on October 26, 2022 at 1:07 AM

The size of language models has been growing exponentially. We should expect 100 trillion parameter dense models by next year. https://i0.wp.com/silvertonconsulting.com/wp-content/uploads/2021/04/Screen-Shot-2021-04-15-at-3.18.31-PM.png?ssl=1

I think that is possible once firms begin using h100 gpus.

manOnPavementWaving t1_itt06eo wrote on October 26, 2022 at 2:06 AM

With H100 the training time optimistically only improves a factor of 9. Not nearly enough to breach the 200x gap between the current largest model and 100 trillion parameter model, and thats in parameter scaling alone, ignoring data scaling. PaLM training took 1200 hours on 6144 tpu v4 chips, and an additional 336 hours on 3072 tpu v4 chips. A 100 trillion parameter model would literally be too big to train before the year 2023 comes to an end.

porcenat_k t1_itt56ff wrote on October 26, 2022 at 2:47 AM

100 billion parameter models seemed impossible too, back when the size of neural networks was a few million. I'm expecting 10 trillion parameters to be human level AGI.

manOnPavementWaving t1_itt6vrn wrote on October 26, 2022 at 3:01 AM

That wasn't 1 year before the prediction of a hundred billion parameters though. Im not doubting that they'll come, im doubting the timeline.

Interested in why you think a 10 trillion parameter would be human level AGI.

porcenat_k t1_ituc77f wrote on October 26, 2022 at 11:28 AM

Artificial neural networks are sufficient mathematical representations of biological cortices. there a huge amount of evidence that concludes this is the case. All that’s left to do is compare human and animal brains to our Ai models. The human brain doesn’t use all 100 trillion parameters on any one task. In fact the brain is divided into regions that allocate compute resources to vision, language, audio etc.. Not even half our brain devotes that many resources to one major region. The upper bound would be 50 trillion parameters. 1 trillion is too small. There aren’t 100 different major cortical regions. There are 10 . All working on the same architecture but processing different modalities. Conservatively 10 trillion parameters are allocated to each major region. Lets take a language model with 10 trillion weights. At that capacity it should be understand language completely. Then, having read all of pub med for example, it would be more knowledgeable than all medical professionals on the planet. A 100 trillion parameter model, I’ve calculated, would be more than a billion times more intelligent than the 10 trillion parameter, in terms of iq, while also having the benefits of of all human knowledge and never being tired and being immortal.

manOnPavementWaving t1_itudq0y wrote on October 26, 2022 at 11:44 AM

What study shows the equivalence of neural network parameters and connections in the brain? What calculations did you do to to get to "a billion times more intelligent"?

porcenat_k t1_itxcyl3 wrote on October 27, 2022 at 12:09 AM

https://ai.facebook.com/blog/studying-the-brain-to-build-ai-that-processes-language-as-people-do/

Here is a link to the one of the most recent developments. There are plenty more.

>What calculations did you do to to get to "a billion times more intelligent"?

That's a long discussion based on assumptions I find to be very reasonable. If you insist, I can do go at length. To simplify see the empirical fact that the second most intelligent species, the chimpanzee, has a cortex just 3x smaller than human. The gap intelligence as a result of such an increase is breathtaking. Indeed, quantity leads to vast qualitative leaps. Chimpanzees and gorillas trillions of years from now have no chance of inventing even the simplest tools. If 3x above chimpanzee is human intelligence, what is 10x above human?

manOnPavementWaving t1_ityolvz wrote on October 27, 2022 at 7:50 AM

They actually do invent tools, but that's not the important thing. What made humans intelligent is having a big brain, and having lots of time. If we were to put a newborn and a baby chimpanzee in a jungle and monitor them, they wouldn't seem all that different regarding intelligence.

Fine if you take that into your calculations, but it can't be attributed to just the bigger brain. Problem being, the 100 trillion parameter model won't have hundreds of thousands of years, and billions of copies of itself.

Cool reference, though! Interesting work

[deleted] t1_itt05b3 wrote on October 26, 2022 at 2:06 AM

[deleted]

SgathTriallair t1_ittiv7r wrote on October 26, 2022 at 4:57 AM

Having difficulty getting the data or physically building the model doesn't mean that the accuracy gains from such a model are diminishing.

That is equivalent to saying asking how fast cars can theoretically go before they fall apart and responding that the speed limit is 65.

It may be difficult to build a 10 trillion parameter model but that doesn't mean it wouldn't be more effective.

TopicRepulsive7936 t1_itt5ek0 wrote on October 26, 2022 at 2:49 AM

You didn't answer the question.

manOnPavementWaving t1_itt6ptg wrote on October 26, 2022 at 3:00 AM

Its an implicit no in the sense that scaling is already slowing

ReasonablyBadass t1_ittrx0o wrote on October 26, 2022 at 6:50 AM

No? There have been a lot of developments of getting results with snaller models though. Basically people figured out ways to not need to train such huge modeks. Which means the bigger models will now be even better. But the focus currently is figuring out how to get the most out of current sizes.

Qumeric t1_itywzd3 wrote on October 27, 2022 at 9:50 AM

We don't have enough data and compute to make 5 trillion models economically feasible. It just doesn't make sense. It's better to create 500B model and train it properly.

manOnPavementWaving t1_itz2ds5 wrote on October 27, 2022 at 10:59 AM

I agree, but you'll find yourself to be a stranger in this thread

Where does the model accuracy increase due to increasing the model's parameters stop? Is AGI possible by just scaling models with the current transformer architecture?

manOnPavementWaving t1_itsn0zt wrote on October 26, 2022 at 12:25 AM