luaks1337 t1_jc320gp wrote
With 4-bit quantization you could run something that compares to text-davinci-003 on a Raspberry Pi or smartphone. What a time to be alive.
Disastrous_Elk_6375 t1_jc3e9ao wrote
With 8-bit this should fit on a 3060 12GB, which is pretty affordable right now. If this works as well as they state it's going to be amazing.
atlast_a_redditor t1_jc3jzcf wrote
I know nothing about these stuff, but I'll rather want the 4-bit 13B model for my 3060 12GB. As I've read somewhere quantisation has less effect on larger models.
disgruntled_pie t1_jc4ffo1 wrote
I’ve successfully run the 13B parameter version of Llama on my 2080TI (11GB of VRAM) in 4-bit mode and performance was pretty good.
pilibitti t1_jc56vv5 wrote
hey do you have a link for how one might set this up?
disgruntled_pie t1_jc5g6or wrote
I’m using this project: https://github.com/oobabooga/text-generation-webui
The project’s Github wiki has a page on llama that explains everything you need.
pdaddyo t1_jc5uoly wrote
And if you get stuck check out /r/oobabooga
sneakpeekbot t1_jc5upgp wrote
Here's a sneak peek of /r/Oobabooga using the top posts of all time!
#1: The new streaming algorithm has been merged. It's a lot faster! | 6 comments
#2: Text streaming will become 1000000x faster tomorrow
#3: LLaMA tutorial (including 4-bit mode) | 10 comments
^^I'm ^^a ^^bot, ^^beep ^^boop ^^| ^^Downvote ^^to ^^remove ^^| ^^Contact ^^| ^^Info ^^| ^^Opt-out ^^| ^^GitHub
pilibitti t1_jc5was5 wrote
thank you!
Maximus-CZ t1_jc3pspo wrote
Holding onto my papers!
sweatierorc t1_jc3ruox wrote
Squeeze that paper
luaks1337 t1_jc5xarn wrote
I hope he makes a video about it!
FaceDeer t1_jc3k2oi wrote
I'm curious, there must be a downside to reducing the bits, mustn't there? What does intensively jpegging an AI's brain do to it? Is this why Lt. Commander Data couldn't use contractions?
luaks1337 t1_jc3p8oq wrote
Backpropagation requires a lot of accuracy so we need 16- or 32-bit while training. However, post-training quantization seems to have very little impact on the results. There are different ways in which you can quantize but apparently llama.cpp uses the most basic way and it still works like a charm. Georgi Gerganov (maintainer) wrote a tweet about it but I can't find it right now.
Viewing a single comment thread. View all comments