Select_Beautiful8 t1_jbr867y wrote on March 11, 2023 at 2:09 AM

Oh it loaded, it was because I wrote "cuda fp32" instead of "cpu fp32" in the second half of the argument. Thanks

KerfuffleV2 t1_jbr95r5 wrote on March 11, 2023 at 2:17 AM

No problem. fp16i8 uses about half the memory of fp16, so what you had would not only use 4x as much memory but it would try to put everything on the GPU!

Select_Beautiful8 t1_jbra2af wrote on March 11, 2023 at 2:25 AM

ok so "cuda fp16i8 *16 -> cpu fp32" would be the most optimal argument for me?

KerfuffleV2 t1_jbrb0qa wrote on March 11, 2023 at 2:33 AM

I'm definitely not qualified to answer a question like that. I'm just a person that managed to get it working on a 6G VRAM GPU. Basically, as far as I understand the more you can run on the GPU, the better. So it really depends on what other stuff you have using your GPU's memory.

Like I mentioned, when I got it working I already had about 1.25G used by other applications and my desktop environment. From my calculations, it should be possible to fit 21, maybe 22 layers onto the GPU as long as nothing else is using it (so basically, you'd have to be in text mode with no desktop environment running).

If you're using Linux and an Nvidia card then you can try install an application called nvtop — it can show stuff like VRAM usage, etc. The way to install it will be specific to your distribution, so I can't help you with that. If you're using Windows or a different OS I can't really help you either.

But anyway, if you can find how much VRAM you have free, you can look at how much of that loading 16 layers uses and calculate how many more you can add before you run out.

That's still not necessarily going to be optimal though. I don't know how stuff like the difference in speed/precision for fp16 vs fp16i8 works or stuff like that. It's not impossible there's some other combination of parameters that would be better in some way than just trying to as much as possible onto the GPU in fp16i8 format. You'd have to ask someone more knowledgeable for a real answer.

Select_Beautiful8 t1_jbrbor0 wrote on March 11, 2023 at 2:38 AM

Thanks, I use Windows, but I want to do a dual boot

KerfuffleV2 t1_jbz7yfk wrote on March 12, 2023 at 9:32 PM

I've been playing with this for a bit and I actually haven't found any case where fp16i8 worked better than halving the layers and using fp16.

If you haven't already tried it, give something like cuda fp16 *7 -> cuda fp16 *0+ -> cpu fp32 *1 a try and see what happens. It's around twice as fast as cuda fp16i8 *16 -> cpu fp32 for me, which is surprising.

That one will use 7 fp16 layers on the GPU, and stream all the rest except the very last as fp16 on the GPU also. The 33rd layer gets run on the CPU. Not sure if that last part makes a big difference.

Select_Beautiful8 t1_jc0w1px wrote on March 13, 2023 at 5:45 AM

This gave me the "out if memory" error again, which did not happen with the "cuda fp18i8 *16 -> cpu fp32" :(

KerfuffleV2 t1_jc18f6a wrote on March 13, 2023 at 8:36 AM

Huh, that's weird. You can try reducing the first one from 7 to 6 or maybe even 5:

cuda fp16 *6 -&gt; cuda fp16 *0+ -&gt; cpu fp32 *1

Also, be sure to double check for typos. :) Any incorrect numbers/punctuation will probably cause problems. Especially the "+" in the second part.

Select_Beautiful8 t1_jc9lckr wrote on March 15, 2023 at 7:31 AM

just got time to try it, but it doesn't load nor does it give error message :( Thanks anyways for your help!

[P] ChatRWKV v2 (can run RWKV 14B with 3G VRAM), RWKV pip package, and finetuning to ctx16K

KerfuffleV2 t1_jbr6r2f wrote on March 11, 2023 at 1:58 AM