Disastrous_Elk_6375 t1_jc3e9ao wrote on March 13, 2023 at 7:20 PM

With 8-bit this should fit on a 3060 12GB, which is pretty affordable right now. If this works as well as they state it's going to be amazing.

atlast_a_redditor t1_jc3jzcf wrote on March 13, 2023 at 7:57 PM

I know nothing about these stuff, but I'll rather want the 4-bit 13B model for my 3060 12GB. As I've read somewhere quantisation has less effect on larger models.

disgruntled_pie t1_jc4ffo1 wrote on March 13, 2023 at 11:29 PM

I’ve successfully run the 13B parameter version of Llama on my 2080TI (11GB of VRAM) in 4-bit mode and performance was pretty good.

pilibitti t1_jc56vv5 wrote on March 14, 2023 at 2:53 AM

hey do you have a link for how one might set this up?

disgruntled_pie t1_jc5g6or wrote on March 14, 2023 at 4:15 AM

I’m using this project: https://github.com/oobabooga/text-generation-webui

The project’s Github wiki has a page on llama that explains everything you need.

pdaddyo t1_jc5uoly wrote on March 14, 2023 at 7:08 AM

And if you get stuck check out /r/oobabooga

sneakpeekbot t1_jc5upgp wrote on March 14, 2023 at 7:08 AM

Here's a sneak peek of /r/Oobabooga using the top posts of all time!

#1: The new streaming algorithm has been merged. It's a lot faster! | 6 comments
#2: Text streaming will become 1000000x faster tomorrow
#3: LLaMA tutorial (including 4-bit mode) | 10 comments

pilibitti t1_jc5was5 wrote on March 14, 2023 at 7:30 AM

thank you!

Maximus-CZ t1_jc3pspo wrote on March 13, 2023 at 8:34 PM

Holding onto my papers!

sweatierorc t1_jc3ruox wrote on March 13, 2023 at 8:47 PM

Squeeze that paper

luaks1337 t1_jc5xarn wrote on March 14, 2023 at 7:45 AM

I hope he makes a video about it!

FaceDeer t1_jc3k2oi wrote on March 13, 2023 at 7:57 PM

I'm curious, there must be a downside to reducing the bits, mustn't there? What does intensively jpegging an AI's brain do to it? Is this why Lt. Commander Data couldn't use contractions?

luaks1337 t1_jc3p8oq wrote on March 13, 2023 at 8:30 PM

Backpropagation requires a lot of accuracy so we need 16- or 32-bit while training. However, post-training quantization seems to have very little impact on the results. There are different ways in which you can quantize but apparently llama.cpp uses the most basic way and it still works like a charm. Georgi Gerganov (maintainer) wrote a tweet about it but I can't find it right now.

[R] Stanford-Alpaca 7B model (an instruction tuned version of LLaMA) performs as well as text-davinci-003

luaks1337 t1_jc320gp wrote on March 13, 2023 at 6:02 PM