Fit-Recognition9795 t1_jdmwd4g wrote on March 25, 2023 at 4:15 PM

#2,354,830

It is not

currentscurrents t1_jdn0opn wrote on March 25, 2023 at 4:46 PM

#2,355,335

The Nvidia H100 marketing material does advertise a configuration for linking 256 of them to train trillion-parameter language models:

>With NVIDIA NVLink® Switch System, up to 256 H100 GPUs can be connected to accelerate exascale workloads. The GPU also includes a dedicated Transformer Engine to solve trillion-parameter language models.

Doesn't necessarily mean GPT-4 is that big, but it's possible. Microsoft and Nvidia were working closely to build the new Azure GPU cloud.

Deep-Station-1746 t1_jdn3vxg wrote on March 25, 2023 at 5:08 PM

#2,355,716

If you say so.

[deleted] t1_jdn4kvn wrote on March 25, 2023 at 5:13 PM

#2,355,796

[removed]

pornthrowaway42069l t1_jdn6noe wrote on March 25, 2023 at 5:28 PM

#2,356,029

Not going to deny that GPT-4 looks impressive, but, they could set up 10 bajillion-quadrillion parameters, question is, do they have the data to effectively utilize all of these? Maybe its time to start looking into decreasing number of parameters, and making more efficient use of the data.

currentscurrents t1_jdn7spo wrote on March 25, 2023 at 5:36 PM

#2,356,136

Replying to pornthrowaway42069l (#2,356,029)

Bigger models are more sample efficient for a given amount of data.

Scale is a triangle of three factors; model size, data size, and compute size. If you want to make more efficient use of data, you need to increase the other two.

In practice LLMs are not data limited right now, they're limited by compute and model size. Which is why you see models like LLaMa that throw huge amounts of data at smaller models.

pornthrowaway42069l t1_jdnmf0j wrote on March 25, 2023 at 7:19 PM

#2,357,618

Replying to currentscurrents (#2,356,136)

I'm confused, how is that different from what I said? Maybe I worded my response poorly, but I meant that we should focus on smaller models, rather than those gigantic ones.

frequenttimetraveler t1_jdo9gw5 wrote on March 25, 2023 at 10:10 PM

#2,360,066

Altman did not say anything about that in Lex Fridman show. He said the 100T rumor was just a meme

How would run time scale with parameter size? Can we infer if 1T is true from the latency of the responses?

lanky_cowriter t1_jdoeyi4 wrote on March 25, 2023 at 10:52 PM

#2,360,614

Sam talked about this on the Lex Friedman podcast, it's not true

[N] GPT-4 has 1 trillion parameters

Comments