ID4gotten t1_jbwz98k wrote on March 12, 2023 at 10:45 AM

Reply to comment by megacewl in [P] vanilla-llama an hackable plain-pytorch implementation of LLaMA that can be run on any system (if you have enough resources) by poppear

Could the 4 bit (edit: or 8 bit) version of the 65B model parameters be run on vanilla-llama to use less than 100GB VRAM?

megacewl t1_jbzts4h wrote on March 13, 2023 at 12:11 AM

I think so? As the model being converted to 8-bit or 4-bit literally means that it was shrunk and is now smaller (and ironically this almost doesn't change the output quality at all), which is why it requires less VRAM to load.

There's tutorials to setup llama.cpp with 4-bit converted LLaMA models which may be worth checking out to help you achieve your goal. llama.cpp is an implementation of LLaMA in C++, that uses the CPU and system RAM. Someone got it running the 7B model on a Raspberry Pi 4 4GB so llama.cpp may be worth checking out if you're low on VRAM.