remghoost7 t1_jbzmfku wrote on March 12, 2023 at 11:15 PM

Reply to comment by Amazing_Painter_7692 in [P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM by Amazing_Painter_7692

Super neat. Thanks for the reply. I'll try that.

Also, do you know if there's a local interface for it....?

I know it's not quite the scope of the post, but it'd be neat to interact with it through a simple python interface (or something like how Gradio is used for A1111's Stable Diffusion) rather than piping it all through Discord.

Amazing_Painter_7692 OP t1_jbzoq05 wrote on March 12, 2023 at 11:33 PM

There's an inference engine class if you want to build out your own API:

https://github.com/AmericanPresidentJimmyCarter/yal-discord-bot/blob/main/bot/llama_model/engine.py#L56-L96

And there's a simple text inference script here:

https://github.com/AmericanPresidentJimmyCarter/yal-discord-bot/blob/main/bot/llama_model/llama_inference.py

Or in the original repo:

https://github.com/qwopqwop200/GPTQ-for-LLaMa

BUT someone has already made a webUI like the automatic1111 one!

https://github.com/oobabooga/text-generation-webui

Unfortunately it looked really complicated for me to set up with 4-bits weights and I tend to do everything over a Linux terminal. :P

toothpastespiders t1_jc01mr9 wrote on March 13, 2023 at 1:12 AM

> BUT someone has already made a webUI like the automatic1111 one!

There's a subreddit for it over at /r/Oobabooga too that deserves more attention. I've only had a little time to play around with it but it's a pretty sleek system from what I've seen.

> it looked really complicated for me to set up with 4-bits weights

I'd like to say that the warnings make it more intimidating than it really is. I think it was just copying and pasting four or five lines for me onto a terminal. Then again I also couldn't get it to work so I might be doing something wrong. I'm guessing it's just that my weirdo gpu wasn't really accounted for somewhere. I'm going to bang my head against it when I've got time just because it's frustrating having tons of vram to spare and not getting the most out of it.

remghoost7 t1_jc0bymy wrote on March 13, 2023 at 2:34 AM

~~I'm having an issue with the C++ compiler on the last step.~~

~~I've been trying to use python 3.10.9 though, so maybe that's my problem....? My venv is set up correctly as well.~~

~~Not specifically looking for help.~~

Apparently this person posted a guide on it in that subreddit. Will report back if I am successful.

edit - Success! But, using WSL instead of Windows (because that was a freaking headache). WSL worked the first time following the instructions on the GitHub page. Would highly recommend using WSL to install it instead of trying to force Windows to figure it out.

Pathos14489 t1_jc0dame wrote on March 13, 2023 at 2:45 AM

r/Oobabooga isn't accessible for me.

remghoost7 t1_jbzqf5m wrote on March 12, 2023 at 11:46 PM

Most excellent. Thank you so much! I will look into all of these.

Guess I know what I'm doing for the rest of the day. Time to make more coffee! haha.

You are my new favorite person this week.

Also, one final question, if you will. What's so unique about the 4-bit weights and why would you prefer to run it in that manner? Is it just VRAM optimization requirements....? I'm decently versed in Stable Diffusion, but LLMs are fairly new territory for me.

My question seemed to have been answered here, and it is a VRAM limitation. Also, that last link seems to support 4-bit models as well. ~~Doesn't seem too bad to set up.... Though I installed A1111 when it first came out, so I learned through the garbage of that. Lol.~~ I was wrong. Oh so wrong. haha.

Yet again, thank you for your time and have a wonderful rest of your day. <3

[deleted] t1_jbzqsrt wrote on March 12, 2023 at 11:49 PM

[removed]