Disastrous_Elk_6375 t1_j99ry6s wrote on February 20, 2023 at 9:46 AM

#1,881,727

GPT-NeoX should fit in 24GB VRAM with 8bit, for inference.

I managed to run GPT-J 6B on a 3060 w/ 12GB and it takes about 7.2GB of VRAM.

ArmagedonAshhole t1_j99tr0r wrote on February 20, 2023 at 10:12 AM

#1,881,830

Replying to Disastrous_Elk_6375 (#1,881,727)

>GPT-NeoX should fit in 24GB VRAM with 8bit, for inference.

GPT-NeoX20B It will fit in 24GB vram but it will almost instantly go out of memory when context will get a bit bigger than starting page of sentences.

head_robotics OP t1_j99tts4 wrote on February 20, 2023 at 10:13 AM

#1,881,838

Replying to Disastrous_Elk_6375 (#1,881,727)

Did you use something like bitsandbytes for the 8bit inference?

How did you implement it?

https://github.com/TimDettmers/bitsandbytes

Disastrous_Elk_6375 t1_j99ujv1 wrote on February 20, 2023 at 10:24 AM

#1,881,880

Replying to head_robotics (#1,881,838)

add this to your .from_pretrained("model" , device_map="auto", load_in_8bit=True)

Transformers does the rest.

Disastrous_Elk_6375 t1_j99xxfa wrote on February 20, 2023 at 11:11 AM

#1,882,082

Replying to ArmagedonAshhole (#1,881,830)

Are there some rough numbers on prompt size vs. ram usage after the model load? I haven't played yet with GPT-NeoX

gliptic t1_j99y0cp wrote on February 20, 2023 at 11:12 AM

#1,882,092

RWKV can run on very little VRAM with Rwkvstic streaming and 8-bit. I've not tested streaming, but I expect it's a lot slower. 7B model sadly takes 8 GB with just 8-bit quantization.

ArmagedonAshhole t1_j9a1vq3 wrote on February 20, 2023 at 12:01 PM

#1,882,370

Replying to Disastrous_Elk_6375 (#1,882,082)

it depends mostly on settings so no.

Small context like 200-300 tokens could work with 24GB but then your AI will not remember and connect dots well which would make model worse than 13B

People are working right now on spliting work between gpu(vram) and cpu(ram) in 8bit mode. I think like 10% to RAM would make model work well on 24GB vram card. IT would be a bit slower but still usable.

If you want you can always load whole model to ram and run it via cpu but it is very slow.

Disastrous_Elk_6375 t1_j9a2877 wrote on February 20, 2023 at 12:05 PM

#1,882,400

Replying to ArmagedonAshhole (#1,882,370)

Thanks!

avocadoughnut t1_j9a64k1 wrote on February 20, 2023 at 12:49 PM

#1,882,776

Replying to gliptic (#1,882,092)

Yup. I'd recommend using whichever RWKV model that can be fit with fp16/bf16. (apparently 8bit is 4x slower and lower accuracy) I've been running GPT-J on a 24GB gpu for months (longer contexts possible using accelerate) and I noticed massive speed increases when using fp16 (or bf16? don't remember) rather than 8bit.

Rockingtits t1_j9afl0a wrote on February 20, 2023 at 2:15 PM

#1,883,741

Why not look into distilled models like DistilBERT

wywywywy t1_j9apjs3 wrote on February 20, 2023 at 3:29 PM

#1,884,867

I had a 3070 with 8GB and I managed to run these locally through KoboldAI.

Meta OPT 2.7B
EleutherAI GPT-Neo 2.7B
BigScience Bloom 1.7B

xrailgun t1_j9aq903 wrote on February 20, 2023 at 3:34 PM

#1,884,942

Replying to wywywywy (#1,884,867)

Did you test any larger and it wouldn't run?

Also, any comments so far among those? Good? Bad? Easy? Etc?

wywywywy t1_j9ar2tk wrote on February 20, 2023 at 3:40 PM

#1,885,021

Replying to xrailgun (#1,884,942)

I did test larger but it didn't run. I can't remember which ones, probably GPT-J. I recently got a 3090 so I can load larger models now.

As for quality, my use case is simple (writing prompt to help with writing stories & articles) and nothing sophisticated, and they worked well. Until ChatGPT came along. I use ChatGPT instead now.

[deleted] t1_j9at634 wrote on February 20, 2023 at 3:54 PM

#1,885,279

Replying to Rockingtits (#1,883,741)

[deleted]

[deleted] t1_j9ati5p wrote on February 20, 2023 at 3:57 PM

#1,885,303

Replying to ArmagedonAshhole (#1,881,830)

[deleted]

xrailgun t1_j9avboh wrote on February 20, 2023 at 4:09 PM

#1,885,505

Replying to wywywywy (#1,885,021)

Thanks!

I wish model publishers would indicate rough (V)RAM requirements...

CommunismDoesntWork t1_j9b1qjb wrote on February 20, 2023 at 4:51 PM

#1,886,155

I'm surprised pytorch doesn't have an option to load models partially in a just in time basis yet. That way even an infinitely large model can be infered on.

wywywywy t1_j9b2kqu wrote on February 20, 2023 at 4:57 PM

#1,886,251

Replying to xrailgun (#1,885,505)

So, not scientific at all, but I've noticed that checkpoint file size * 0.6 is pretty close to actual VRAM requirement for LLM.

But you're right it'd be nice to have a table handy.

gpt-doktor-6b t1_j9b3u79 wrote on February 20, 2023 at 5:05 PM

#1,886,385

You might be interested in this tutorial on loading large models. They promise you the ability to inference model as long as you have enough disk space.

https://huggingface.co/blog/accelerate-large-models

Emergency_Apricot_77 t1_j9b68si wrote on February 20, 2023 at 5:21 PM

#1,886,651

Replying to Rockingtits (#1,883,741)

They literally asked for LARGE language models

[deleted] t1_j9b8duw wrote on February 20, 2023 at 5:35 PM

#1,886,869

Replying to Emergency_Apricot_77 (#1,886,651)

[deleted]

Last-Belt-4010 t1_j9b8gtl wrote on February 20, 2023 at 5:35 PM

#1,886,879

Just a question does this work with non Nvidia gpus? Like Intel arc and such

catch23 t1_j9b9upb wrote on February 20, 2023 at 5:44 PM

#1,887,014

Could try something like this: https://github.com/Ying1123/FlexGen

This was only released a few hours ago, so there's no way for you to have discovered this previously. Basically makes use of various strategies if your machine has lots of normal cpu memory. The paper authors were able to fit a 175B parameter model on their lowly 16GB T4 gpu (with a machine with 200GB of normal memory).

pyepyepie t1_j9bbg1b wrote on February 20, 2023 at 5:54 PM

#1,887,166

Try to use both GPUs with this one: https://github.com/huggingface/accelerate https://huggingface.co/docs/accelerate/usage_guides/big_modeling https://huggingface.co/blog/accelerate-large-models Maybe it will help (the last link is clearer IMHO).

Purplekeyboard t1_j9bd1jg wrote on February 20, 2023 at 6:05 PM

#1,887,324

Keep in mind, these smaller models are going to be a lot dumber than what you've likely seen in GPT-3.

k3iter t1_j9bn6k3 wrote on February 20, 2023 at 7:09 PM

#1,888,326

Nel

Artichoke-Lower t1_j9bnbgf wrote on February 20, 2023 at 7:10 PM

#1,888,342

This seems really promising also https://github.com/Ying1123/FlexGen

EuphoricPenguin22 t1_j9c51t7 wrote on February 20, 2023 at 9:06 PM

#1,889,989

Replying to catch23 (#1,887,014)

Does that increase inference time?

AnothaUselessComment t1_j9c9er6 wrote on February 20, 2023 at 9:34 PM

#1,890,431

Yikes, this may be tough.

I know you can try Bloom (like this blog post tried) and let it try and download overnight, but you may run into problems. (I've heard the download takes forever)

https://enjoymachinelearning.com/blog/gpt-3-vs-bloom/

Though I will say, it's probably worth whatever cost you're trying to dodge just to hit an API, even if your hardware is great.

luaks1337 t1_j9cajyf wrote on February 20, 2023 at 9:42 PM

#1,890,539

Replying to EuphoricPenguin22 (#1,889,989)

Yes, at least if I read the documentation correctly.

catch23 t1_j9cd5tw wrote on February 20, 2023 at 9:59 PM

#1,890,762

Replying to EuphoricPenguin22 (#1,889,989)

it does look to be 20-100x slower for those huge models, but still bearable if you're the only user on the machine. Still better than nothing if you don't have lots of GPU memory.

EuphoricPenguin22 t1_j9ceqy4 wrote on February 20, 2023 at 10:10 PM

#1,890,885

Replying to catch23 (#1,890,762)

Yeah, and DDR4 DIMMs are fairly inexpensive as compared to upgrading a GPU for more VRAM.

nikola-b t1_j9cqkys wrote on February 20, 2023 at 11:32 PM

#1,891,957

Not sure if this helps, but you can use our hosted flan-t5 model at deepinfra.com using HTTP API. It's free for now. Disclaimer I work at deepinfra. If you want GPT-Neo or GPT-J I can deploy those also.

pyonsu2 t1_j9ds6j5 wrote on February 21, 2023 at 4:28 AM

#1,895,850

Depends on what you’re trying to do but just use OpenAI APIs. Your effort/time is also expensive.

xrailgun t1_j9dtp9c wrote on February 21, 2023 at 4:42 AM

#1,896,002

Replying to Emergency_Apricot_77 (#1,886,651)

It might not be unreasonable to think maybe OP primarily wants the functionality of current LLMs, and if something can provide that more efficiently (or has promise to in the near future), s/he may want to know about it too.

YinYang-Mills t1_j9dtwjh wrote on February 21, 2023 at 4:44 AM

#1,896,020

Is there a way to do it with single precision?

smallfried t1_j9dtyf7 wrote on February 21, 2023 at 4:45 AM

#1,896,023

Replying to catch23 (#1,887,014)

That is very interesting!

The paper is not yet on GitHub, but I'm assuming the hardware requirements are as mentioned one beefy consumer GPU (3090) and a whole bunch of DRAM (>210GB) ?

I've played with opt-175b and with a bit of twiddling it can actually generate some Python code :)

This is very exciting as it gets these models into the prosumer range hardware!

catch23 t1_j9dxlze wrote on February 21, 2023 at 5:21 AM

#1,896,373

Replying to smallfried (#1,896,023)

Their benchmark was done on a 16GB T4 which is anything but beefy. The T4 maxes out at 80W power consumption, and was primarily marketed toward model inference. The T4 is the cheapest GPU offered by google cloud.

[deleted] t1_j9e1uoq wrote on February 21, 2023 at 6:07 AM

#1,896,757

[deleted]

Baeocystin t1_j9e6s12 wrote on February 21, 2023 at 7:06 AM

#1,897,174

Replying to Last-Belt-4010 (#1,886,879)

The tl;dr for all GPU questions is that CUDA is the answer. There are no other even 'kinda' contenders.

I'm not happy about the monopoly, but that's where we're at, and there is nothing on the horizon pointing otherwise, either.

halixness t1_j9e80y1 wrote on February 21, 2023 at 7:22 AM

#1,897,270

So far I have tried BLOOM Petals (a distributed LLM), inference took me around 30s for a single prompt on a 8GB VRAM gpu, but not bad!

Snoo9704 t1_j9e8k2w wrote on February 21, 2023 at 7:28 AM

#1,897,309

I'm a super learning noob, but is there a reason you can't substitute large amounts of VRAM with large amounts of DRAM?

I know RAM bandwidth is important, but does it make that much of a difference if I got 256GB of quad channel DRAM and only 8GB VRAM? Compared to a more typical 32GB DRAM and 24GB VRAM?

tyras_ t1_j9e9kp0 wrote on February 21, 2023 at 7:42 AM

#1,897,385

Replying to nikola-b (#1,891,957)

Free for now or free for an hour as the pricing tab indicates?

IntrepidTieKnot t1_j9egvzc wrote on February 21, 2023 at 9:22 AM

#1,897,885

Replying to Snoo9704 (#1,897,309)

yes

hummingairtime t1_j9ey0bz wrote on February 21, 2023 at 12:56 PM

#1,899,319

Replying to gliptic (#1,882,092)

I appreciate you

hummingairtime t1_j9ey1bv wrote on February 21, 2023 at 12:56 PM

#1,899,322

Replying to Purplekeyboard (#1,887,324)

really

marcus_hk t1_j9g5hns wrote on February 21, 2023 at 6:05 PM

#1,903,936

Seems it shouldn't be too difficult to run one stage or layer at a time and cache intermediate results.

nikola-b t1_j9hk5q4 wrote on February 22, 2023 at 12:45 AM

#1,909,827

Replying to tyras_ (#1,897,385)

Free for now, we have not added the payment workflow. In the future, you are billed only for the inference time, so with 1h you should be able to generate lots of tokens. Also I added EleutherAI/gpt-neo-2.7B and EleutherAI/gpt-j-6B if the op wants to try them.

2muchnet42day t1_j9j5wl3 wrote on February 22, 2023 at 10:11 AM

#1,916,111

Replying to pyonsu2 (#1,895,850)

And the hardware.

tyras_ t1_j9pjkcx wrote on February 23, 2023 at 5:36 PM

#1,943,046

Replying to nikola-b (#1,909,827)

I finally got some time and was excited to try out. I did not see many LLMs pretrained on biomedical data available anywhere.

Anyway, while I could log in without a problem both CURL and deepctl return 401. Now I wonder whether it was cut off or did I miss some extra registration or authorization step that was not mentioned in the docs.

nikola-b t1_j9ujdux wrote on February 24, 2023 at 5:38 PM

#1,969,851

Replying to tyras_ (#1,943,046)

There was auth bug in the code. Sorry for that. Please try again now.

Comments