KerfuffleV2 t1_jbz7yfk wrote on March 12, 2023 at 9:32 PM

Reply to comment by Select_Beautiful8 in [P] ChatRWKV v2 (can run RWKV 14B with 3G VRAM), RWKV pip package, and finetuning to ctx16K by bo_peng

I've been playing with this for a bit and I actually haven't found any case where fp16i8 worked better than halving the layers and using fp16.

If you haven't already tried it, give something like cuda fp16 *7 -> cuda fp16 *0+ -> cpu fp32 *1 a try and see what happens. It's around twice as fast as cuda fp16i8 *16 -> cpu fp32 for me, which is surprising.

That one will use 7 fp16 layers on the GPU, and stream all the rest except the very last as fp16 on the GPU also. The 33rd layer gets run on the CPU. Not sure if that last part makes a big difference.

Select_Beautiful8 t1_jc0w1px wrote on March 13, 2023 at 5:45 AM

This gave me the "out if memory" error again, which did not happen with the "cuda fp18i8 *16 -> cpu fp32" :(

KerfuffleV2 t1_jc18f6a wrote on March 13, 2023 at 8:36 AM

Huh, that's weird. You can try reducing the first one from 7 to 6 or maybe even 5:

cuda fp16 *6 -&gt; cuda fp16 *0+ -&gt; cpu fp32 *1

Also, be sure to double check for typos. :) Any incorrect numbers/punctuation will probably cause problems. Especially the "+" in the second part.

Select_Beautiful8 t1_jc9lckr wrote on March 15, 2023 at 7:31 AM

just got time to try it, but it doesn't load nor does it give error message :( Thanks anyways for your help!