I put together this plain pytorch implementation of LLaMA (i just substituted the fairscale layers with the native ones and converted the weights accordingly) that can be more easily run in different environments.

The big problem with the official implementation is that in order to run the 65B version you need 8 GPUs no matter what, and to run the 30B version you need 4 and so on. In reality you can easily fit the 65B version in 2 A100 with 100G of VRAM.

vanilla-llama solves this problem. You just need to have enough memory and the model will be load in all the available GPUs.

https://github.com/galatolofederico/vanilla-llama

Comments

You must log in or register to comment.

PuzzledWhereas991 t1_jbvcei5 wrote on March 12, 2023 at 12:26 AM

#2,211,411

Which model can I run with 2 3060ti (8gb) and 1 3080 ti (12gb)?

megacewl t1_jbvuksj wrote on March 12, 2023 at 2:53 AM

#2,212,140

Replying to PuzzledWhereas991 (#2,211,411)

Not sure about vanilla-llama but at the moment you can run LLaMA-13B at 4bit with >10GB of VRAM, so your 3080ti can run it.

To run 30B at 4bit, you need at least 20GB of VRAM. If your motherboard supports SLI, you can use nvlink to share the VRAM between your two GPUs and have a collective 20GB, which would let you run the 30B model provided you have enough system RAM.

Not sure if I can post the link to the tutorial here but Google "rentry Llama v2" and click the "LLaMA Int8 4bit ChatBot Guide v2" result for the most up-to-date tutorial to run it.

LoaderD t1_jbw3640 wrote on March 12, 2023 at 4:10 AM

#2,212,436

> In reality you can easily fit the 65B version in 2 A100 with 100G of VRAM.

Ughhh are you telling me I have to SSH into my DGX 100 instead of just using my local machine with 1 A100? (Satire I am a broke student)

Appreciate the implementation and transparency. I don't think many people realize how big a 65B parameter model is since there's no associated cost with downloading them.

ID4gotten t1_jbwz98k wrote on March 12, 2023 at 10:45 AM

#2,213,326

Replying to megacewl (#2,212,140)

Could the 4 bit (edit: or 8 bit) version of the 65B model parameters be run on vanilla-llama to use less than 100GB VRAM?

SpaceCockatoo t1_jbz2mns wrote on March 12, 2023 at 8:54 PM

#2,216,327

Any plans to implement 4/8-bit quantization?

poppear OP t1_jbzhgfh wrote on March 12, 2023 at 10:39 PM

#2,216,967

Replying to SpaceCockatoo (#2,216,327)

I was thinking about it. It shouldn't be so hard, i will probably git it a try as soon as I will have some spare time 😀

megacewl t1_jbzts4h wrote on March 13, 2023 at 12:11 AM

#2,217,459

Replying to ID4gotten (#2,213,326)

I think so? As the model being converted to 8-bit or 4-bit literally means that it was shrunk and is now smaller (and ironically this almost doesn't change the output quality at all), which is why it requires less VRAM to load.

There's tutorials to setup llama.cpp with 4-bit converted LLaMA models which may be worth checking out to help you achieve your goal. llama.cpp is an implementation of LLaMA in C++, that uses the CPU and system RAM. Someone got it running the 7B model on a Raspberry Pi 4 4GB so llama.cpp may be worth checking out if you're low on VRAM.

kryatoshi t1_jc8suy9 wrote on March 15, 2023 at 2:43 AM

#2,232,017

You can fit 4bit quantized 65B on M1 Max 64GB RAM, it takes of 40GB unified memory. here https://twitter.com/tljstewart/status/1635326012346048528?s=20