I'm trying to figure out how to go about running something like GPT-J, FLAN-T5, etc, on my PC, without using cloud compute services (because privacy and other reasons). However, GPT-J-6B needs either ~14 GB of VRAM or 4x as much plain RAM.

Upgrading my PC for 48 GB of RAM is possible, and 16, 24 GB graphics cards are available for general public (though they cost as much as a car), but anything beyond that is in the realm of HPC, datacenter hardware and "GPU accelerators"... I.e. 128 GB GPUs exist out there somewhere, but the distributors don't even list a price, it's just "get a quote" and "contact us"... meaning it's super expensive and you need to be a CEO of medium-sized company for them to even talk to you?

I'm trying to figure out if it's possible to run the larger models (e.g. 175B GPT-3 equivalents) on consumer hardware, perhaps by doing a very slow emulation using one or several PCs such that their collective RAM (or swap SDD space) matches the VRAM needed for those beasts.

So the question is "will it run super slowly" or "will it fail immediately due to completely incompatible software / being impossible to configure for anything other than real datacenter hardware"?

Comments

You must log in or register to comment.

CKtalon t1_j13dg5b wrote on December 21, 2022 at 11:06 AM

Just forget about it.

Yes, it's possible to do it on CPU/RAM (Threadripper builds with > 256GB RAM + some assortment of 2x-4x GPUs), but the speed is so slow that it's pointless working with it. Deepspeed or Hugging Face can spread it out between GPU and CPU, but even so, it will be stupid slow, probably MINUTES per token.

We are at least 5 years away before consumer hardware can run 175+B models on a single machine (4 GPUs in a single machine).

20B models are in the realm of consumer hardware (3090/4090) with INT8, though slow, but still possible.

caedin8 t1_j147bx3 wrote on December 21, 2022 at 3:31 PM

Is this just training? What about inferences? How does chatGPT serve millions of people so quickly if it needs such enterprise hardware per request

artsybashev t1_j154fhy wrote on December 21, 2022 at 7:04 PM

it is just the inference. Training requires more like 100 x A100 and a cluster to train on. Just a million to get started.

AltruisticNight8314 t1_j1ohh7u wrote on December 26, 2022 at 2:38 AM

What hardware would be required to i) train or ii) fine-tune weights (i.e. run a few epochs on my own data) for medium-sized transformers (500M-15B parameters)?

I do research on proteomics and I have a very specific problem where perhaps even fine-tuning the weights of a trained transformer (such as ESM-2) might be great.

Of course, there's always the poor man's alternative of building a supervised model on the embeddings returned by the encoder.

artsybashev t1_j1ph7f3 wrote on December 26, 2022 at 9:40 AM

one A100 80GB will get you started with models 500M-15B. You can rent that for a $50 per day. See where that takes you in a week.

AltruisticNight8314 t1_j1soeji wrote on December 27, 2022 at 2:17 AM

Thanks!

Misaiato t1_j14pagb wrote on December 21, 2022 at 5:27 PM

MSFT Azure. It has unlimited resources available to it.

gBoostedMachinations t1_j155zas wrote on December 21, 2022 at 7:14 PM

Training is what takes so much computation in almost all cases. Once the model itself is trained only a tiny fraction of the compute is needed. Most trained ML models that ship today can generate predictions on a raspberry pi or a cell phone. LLMs still require more hardware for inference, but you’d be surprised how little they need compared to what’s needed for training.

calv420 t1_j15ytb1 wrote on December 21, 2022 at 10:26 PM

Don't see why you're getting down voted, inference requires significantly less compute vs training.

gBoostedMachinations t1_j16pzea wrote on December 22, 2022 at 1:49 AM

If there’s on thing I’ve learned about Reddit, it’s that you can make the most uncontroversial comment of the year and still get downvoted. I mean, I got banned from r/coronavirus for pointing out that people who recover from covid probably have at least a little tiny bit of immunity to re-infection.

After covid, I’ve learned to completely ignore my comment scores when it comes to feedback on Reddit. The only way to know if one of my comments is valued is to read the replies.

CKtalon t1_j16qtog wrote on December 22, 2022 at 1:55 AM

Training will at minimum need about 10x more resources than what I said (inferencing). And that’s just to fit the model and all its optimisation weights with batch size 1.

visarga t1_j14bnb7 wrote on December 21, 2022 at 3:59 PM

GLM-130B runs on 4x 3090, uses INT4.

gBoostedMachinations t1_j155nsu wrote on December 21, 2022 at 7:12 PM

It’s kind of scary to think how soon the tech will enable randos to make LLMs. Sure, at first expertise will be needed but as we’ve seen before it’s only a matter of a brief period of time before the tools needed for the average Joe to train a model are made available.

Jfc shit is getting weird

mmeeh t1_j13urkr wrote on December 21, 2022 at 2:00 PM

256 GB :O

[deleted] t1_j15vcej wrote on December 21, 2022 at 10:02 PM

[removed]

recidivistic_shitped t1_j136lsh wrote on December 21, 2022 at 9:30 AM

GPT-J-6B can load under 8GB vram with Int8.LLM. For this same reason, you can also run it in Colab nowadays.

175B.... Really bad idea to offload it to normal RAM. Inference is more limited by FLOPS than memory at that scale. OpenAI's API is cheap enough unless you're scaling to a substantial userbase.

GoofAckYoorsElf t1_j12t661 wrote on December 21, 2022 at 6:34 AM

A small car. I just bought a new 3090Ti with 24GB VRAM for as little as 1300€. I don't find that overly expensive.

yashdes t1_j1420uo wrote on December 21, 2022 at 2:55 PM

He's probably referring to Quadros, those things are stupid expensive even in comparison to the 3090/4090

GoofAckYoorsElf t1_j14bxe6 wrote on December 21, 2022 at 4:01 PM

True, but who needs a Quadro, if a 3090Ti is entirely sufficient?

yashdes t1_j14h1mm wrote on December 21, 2022 at 4:34 PM

100% agree, love my 3090s, but hope they keep coming down in price so I can get more :D

BelialSirchade t1_j174112 wrote on December 22, 2022 at 3:41 AM

More vram probably, but you can just hook up 2 3090 ti at half the price

Though for LLM you probably need 10 3090 ti and even then it’s probably not enough

arg_max t1_j136nbo wrote on December 21, 2022 at 9:31 AM

CPU implementations are going to be very slow. I'd probably try renting an A100 VM, running some experiments, and measuring VRAM and RAM usage. But I'd be surprised if anything below a 24G 3090TI is going to do the job. The issue is that bigger than 24GB means you have to go A6000 which costs as much as 4 3090s.

arg_max t1_j136y5q wrote on December 21, 2022 at 9:35 AM

Just to give you an idea about "optimal configuration" though, this is way beyond desktop PC levels:
You will need at least 350GB GPU memory on your entire cluster to serve the OPT-175B model. For example, you can use 4 x AWS p3.16xlarge instances, which provide 4 (instance) x 8 (GPU/instance) x 16 (GB/GPU) = 512 GB memory.

https://alpa.ai/tutorials/opt_serving.html

sayoonarachu t1_j1408am wrote on December 21, 2022 at 2:42 PM

If you're savy enough, you can technically run BLOOM 176b . But as others stated, it'll take forever to be usable. I.e 30 minutes for 10 token.

https://towardsdatascience.com/run-bloom-the-largest-open-access-ai-model-on-your-desktop-computer-f48e1e2a9a32

avialex t1_j14p22o wrote on December 21, 2022 at 5:26 PM

There's a VRAM memory leak in that code btw. I haven't tracked it down yet, but it's easy to solve with a torch cache clear in the forward method.

suflaj t1_j13pqhe wrote on December 21, 2022 at 1:18 PM

While you can run large models (layer by layer, batch by batch, dimension by dimension or element by element), the problem is getting to the weights. No one said you need to transform your input to the output in one go. All that is important is that there is no single operation that would make you go OOM.

Theoretically, there is no network where a linear combination would exceed modern memory sizes, but this doesn't mean that such a strategy would be fast. At the base level, all you need is 3 registers (2 for addition and multiplication, 1 to keep sum aggregate) and enough memory to store the network weights.

LetterRip t1_j12uqxv wrote on December 21, 2022 at 6:53 AM

Deepspeed, you can map weights to the SSD, very slow but possible.

wywywywy t1_j151o6u wrote on December 21, 2022 at 6:46 PM

You could run a cut-down version of such models. I managed to run inference on OPT 2.7B, GPT-Neo 2.7B, etc on my 8GB gpu.

Now that I've upgraded to a used 3090, I can run OPT 6.7B, GPT-J 6B, etc.

Final-Rush759 t1_j12zqjw wrote on December 21, 2022 at 7:56 AM

Model parallelism. But you need more than 1 card. Buy A6000 which has 48 GB vram.

limapedro t1_j13qfxr wrote on December 21, 2022 at 1:25 PM

The cheaper option would be to run on 2 RTX 3060s! Each GPU costing 300 USD you could buy two for 600ish! Also there's a 16 GB A770 from Intel! To run a very large model you could split the weights into so called blocks, I was able to test it to myself in a simple keras implementation, but the code for conversion is hard to write, although I think I've seen somewhere something similar from HuggingFace!

maizeq t1_j162xtk wrote on December 21, 2022 at 10:54 PM

How is the tooling and performance for the A770 on machine learning workloads? Do you have any experience with it?

limapedro t1_j175nby wrote on December 22, 2022 at 3:55 AM

No, I haven't! Although in theory it should be really good, you could still run Deep Learninig using Directml, but a native implemenation should be really fast because of its XMX cores, they're similar to Tensor Cores.

wywywywy t1_j18a6g2 wrote on December 22, 2022 at 12:00 PM

I haven't tried it myself but Intel has their own dist of Python and they also have their own Pytorch extension. They seem to be quite usable from looking at some of the github comments.

SpaceCockatoo t1_j12q9me wrote on December 21, 2022 at 6:01 AM

I too would like to know if this is even theoretically possible

head_robotics t1_j14e1fn wrote on December 21, 2022 at 4:15 PM

Another question could be what is the minimal sized language model that could be useful?
If the largest models can't be reasonably run, what about smaller models that could be?
Any chance of getting usable results for reasonable speed?

caninerosie t1_j12qgv6 wrote on December 21, 2022 at 6:03 AM

there are a ton of consumer motherboards that support 128GB max RAM. a single 3090 also has 24GB GDDR6X of memory. If you need more than that you can NVLink another 3090 with the added benefit of speeding up training. That’s getting pretty pricey though.

other than that, there’s the M1 Ultra Mac Studio? won’t be as fast as training on a dedicated GPU but you’ll have the memory for it and faster throughput than normal DRAM

edit: for an extremely large model like GPT-3 you would need almost 400 GB of RAM. theoretically you could build multiple machines with NVLinked 3090/4090s, all networked together for distributed training

DavesEmployee t1_j1426ms wrote on December 21, 2022 at 2:56 PM

4090s don’t support NVLink unfortunately 🥲

BelialSirchade t1_j174fni wrote on December 22, 2022 at 3:44 AM

You don’t need NVLink though, PyTorch support model parallelism through deepspeed anyways, so go ahead and buy that extra 4090

caninerosie t1_j14738h wrote on December 21, 2022 at 3:29 PM

really? NVIDIA is so weird

DavesEmployee t1_j147fki wrote on December 21, 2022 at 3:31 PM

I think it’s because they’re mostly used for games which almost never take advantage of the technology. You can tell from the designs that they were going to support it but the feature was taken out probably due to price or power concerns

[deleted] t1_j16dhck wrote on December 22, 2022 at 12:12 AM

[deleted]