Amazing_Painter_7692 OP t1_jbzov27 wrote on March 12, 2023 at 11:34 PM Reply to comment by stefanof93 in [P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM by Amazing_Painter_7692 https://github.com/qwopqwop200/GPTQ-for-LLaMa Performance is quite good. Permalink Parent 24
Amazing_Painter_7692 OP t1_jbzoq05 wrote on March 12, 2023 at 11:33 PM Reply to comment by remghoost7 in [P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM by Amazing_Painter_7692 There's an inference engine class if you want to build out your own API: https://github.com/AmericanPresidentJimmyCarter/yal-discord-bot/blob/main/bot/llama_model/engine.py#L56-L96 And there's a simple text inference script here: https://github.com/AmericanPresidentJimmyCarter/yal-discord-bot/blob/main/bot/llama_model/llama_inference.py Or in the original repo: https://github.com/qwopqwop200/GPTQ-for-LLaMa BUT someone has already made a webUI like the automatic1111 one! https://github.com/oobabooga/text-generation-webui Unfortunately it looked really complicated for me to set up with 4-bits weights and I tend to do everything over a Linux terminal. :P Permalink Parent 15
Amazing_Painter_7692 OP t1_jbzbcmi wrote on March 12, 2023 at 9:55 PM Reply to comment by remghoost7 in [P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM by Amazing_Painter_7692 Should work fine with the 7b param model: https://huggingface.co/decapoda-research/llama-7b-hf-int4 Permalink Parent 18
Amazing_Painter_7692 OP t1_jbz7hta wrote on March 12, 2023 at 9:28 PM Reply to comment by 3deal in [P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM by Amazing_Painter_7692 It's the HuggingFace transformers module version of the weights from Meta/Facebook Research. https://github.com/huggingface/transformers/pull/21955 Permalink Parent 18
[P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM github.com Submitted by Amazing_Painter_7692 t3_11pmz69 on March 12, 2023 at 7:13 PM in MachineLearning 51 comments 320
Amazing_Painter_7692 OP t1_jbzov27 wrote
Reply to comment by stefanof93 in [P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM by Amazing_Painter_7692
https://github.com/qwopqwop200/GPTQ-for-LLaMa
Performance is quite good.