Submitted by bo_peng t3_11f9k5g in MachineLearning
Hi everyone. Now ChatRWKV v2 can split RWKV to multiple GPUs, or stream layers (compute layer-by-layer), so you can run RWKV 14B with as few as 3G VRAM. https://github.com/BlinkDL/ChatRWKV
Example:
'cuda:0 fp16 *10 -> cuda:1 fp16 *8 -> cpu fp32'
= first 10 layers on cuda:0 fp16, then 8 layers on cuda:1 fp16, then on cpu fp32
'cuda fp16 *20+'
= first 20 layers on cuda fp16, then stream the rest on it
And RWKV is now a pip package: https://pypi.org/project/rwkv/
os.environ['RWKV_JIT_ON'] = '1'
os.environ["RWKV_CUDA_ON"] = '0' # if '1' then compile CUDA kernel for seq mode (much faster)
from rwkv.model import RWKV
from rwkv.utils import PIPELINE, PIPELINE_ARGS
pipeline = PIPELINE(model, "20B_tokenizer.json") # find it in https://github.com/BlinkDL/ChatRWKV
# download models: https://huggingface.co/BlinkDL
model = RWKV(model='/fsx/BlinkDL/HF-MODEL/rwkv-4-pile-169m/RWKV-4-Pile-169M-20220807-8023', strategy='cpu fp32')
ctx = "\nIn a shocking finding, scientist discovered a herd of dragons living in a remote, previously unexplored valley, in Tibet. Even more surprising to the researchers was the fact that the dragons spoke perfect Chinese."
print(ctx, end='')
def my_print(s):
print(s, end='', flush=True)
# For alpha_frequency and alpha_presence, see "Frequency and presence penalties":
# https://platform.openai.com/docs/api-reference/parameter-details
args = PIPELINE_ARGS(temperature = 1.0, top_p = 0.7,
alpha_frequency = 0.25,
alpha_presence = 0.25,
token_ban = [0], # ban the generation of some tokens
token_stop = []) # stop generation whenever you see any token here
pipeline.generate(ctx, token_count=512, args=args, callback=my_print)
Right now all RWKV models are still trained with GPT-like method, so they are limited by the ctxlen used in training, even though in theory they should have almost infinite ctxlen (because they are RNNs). However RWKV models can be easily finetuned to support longer ctxlens (and large models actually use the ctxlen). I have finetuned 1B5/3B/7B/14B to ctx4K, and now finetuning 7B/14B to ctx8K, and 14B to ctx16K after that :) All models are available at https://huggingface.co/BlinkDL
The core RWKV is still mostly an one-man project, but a number of great developers are building on top of it, and you are welcome to join our community :)
satireplusplus t1_jaiwxlo wrote
Wow, nice, I will try it out!
Btw: If you want to format your code in your post, you need to add 4 spaces in front of any line in your post. Otherwise all newlines are lost.
Lines starting with four spaces are treated like code: