megacewl
megacewl t1_jbvuksj wrote
Reply to comment by PuzzledWhereas991 in [P] vanilla-llama an hackable plain-pytorch implementation of LLaMA that can be run on any system (if you have enough resources) by poppear
Not sure about vanilla-llama but at the moment you can run LLaMA-13B at 4bit with >10GB of VRAM, so your 3080ti can run it.
To run 30B at 4bit, you need at least 20GB of VRAM. If your motherboard supports SLI, you can use nvlink to share the VRAM between your two GPUs and have a collective 20GB, which would let you run the 30B model provided you have enough system RAM.
Not sure if I can post the link to the tutorial here but Google "rentry Llama v2" and click the "LLaMA Int8 4bit ChatBot Guide v2" result for the most up-to-date tutorial to run it.
megacewl t1_jbzts4h wrote
Reply to comment by ID4gotten in [P] vanilla-llama an hackable plain-pytorch implementation of LLaMA that can be run on any system (if you have enough resources) by poppear
I think so? As the model being converted to 8-bit or 4-bit literally means that it was shrunk and is now smaller (and ironically this almost doesn't change the output quality at all), which is why it requires less VRAM to load.
There's tutorials to setup llama.cpp with 4-bit converted LLaMA models which may be worth checking out to help you achieve your goal. llama.cpp is an implementation of LLaMA in C++, that uses the CPU and system RAM. Someone got it running the 7B model on a Raspberry Pi 4 4GB so llama.cpp may be worth checking out if you're low on VRAM.