bo_peng
bo_peng OP t1_jck4qkr wrote
Reply to comment by sanderbaduk in [R] RWKV 14B ctx8192 is a zero-shot instruction-follower without finetuning, 23 token/s on 3090 after latest optimization (16G VRAM is enough, and you can stream layers to save more VRAM) by bo_peng
I manually disabled the <|endoftext|> token in the demo, so it can output irrelevant contents after a task is completed :)
bo_peng OP t1_jcjuvg9 wrote
Reply to comment by londons_explorer in [R] RWKV 14B ctx8192 is a zero-shot instruction-follower without finetuning, 23 token/s on 3090 after latest optimization (16G VRAM is enough, and you can stream layers to save more VRAM) by bo_peng
Yeah that will be cool. You are welcome to try it and I can help.
The rwkv pip package: https://pypi.org/project/rwkv/
bo_peng OP t1_jcjupnc wrote
Reply to comment by ThePerson654321 in [R] RWKV 14B ctx8192 is a zero-shot instruction-follower without finetuning, 23 token/s on 3090 after latest optimization (16G VRAM is enough, and you can stream layers to save more VRAM) by bo_peng
Soon :) working on it. Meanwhile take a look at https://github.com/ridgerchu/SpikeGPT which is a SNN version of RWKV, so has some explanation in the paper.
bo_peng OP t1_jcjuinf wrote
Reply to comment by yehiaserag in [R] RWKV 14B ctx8192 is a zero-shot instruction-follower without finetuning, 23 token/s on 3090 after latest optimization (16G VRAM is enough, and you can stream layers to save more VRAM) by bo_peng
More ctxlen and slightly better trained :) same speed & vram
bo_peng OP t1_jcjuhix wrote
Reply to comment by blueSGL in [R] RWKV 14B ctx8192 is a zero-shot instruction-follower without finetuning, 23 token/s on 3090 after latest optimization (16G VRAM is enough, and you can stream layers to save more VRAM) by bo_peng
Yes ChatRWKV v2 supports that :)
Take a look at the "strategy" guide: https://pypi.org/project/rwkv/
bo_peng OP t1_jcjuejz wrote
Reply to comment by cipri_tom in [R] RWKV 14B ctx8192 is a zero-shot instruction-follower without finetuning, 23 token/s on 3090 after latest optimization (16G VRAM is enough, and you can stream layers to save more VRAM) by bo_peng
ChatRNN is indeed a great name :)
R W K V are the four major parameters in RWKV (similar to QKV for attention).
I guess you can pronounce it like "Rwakuv" (A bit like racoon)
bo_peng OP t1_jccc46c wrote
Reply to comment by KerfuffleV2 in [P] ChatRWKV v2 (can run RWKV 14B with 3G VRAM), RWKV pip package, and finetuning to ctx16K by bo_peng
I am using torch JIT so close ;)
bo_peng OP t1_jcb05e8 wrote
Reply to comment by KerfuffleV2 in [P] ChatRWKV v2 (can run RWKV 14B with 3G VRAM), RWKV pip package, and finetuning to ctx16K by bo_peng
stay tuned :) will fix it
bo_peng OP t1_jc9gf72 wrote
Reply to comment by KerfuffleV2 in [P] ChatRWKV v2 (can run RWKV 14B with 3G VRAM), RWKV pip package, and finetuning to ctx16K by bo_peng
Update ChatRWKV v2 & pip rwkv package (0.5.0) and set os.environ["RWKV_CUDA_ON"] = '1'
for 1.5x speed f16i8 (and 10% less VRAM, now 14686MB for 14B instead of 16462M - so you can put more layers on GPU)
bo_peng OP t1_jc2alfm wrote
Reply to comment by KerfuffleV2 in [P] ChatRWKV v2 (can run RWKV 14B with 3G VRAM), RWKV pip package, and finetuning to ctx16K by bo_peng
Try rwkv 0.4.0 & latest ChatRWKV for 2x speed :)
bo_peng t1_jbkwfzr wrote
Reply to [D] Why isn't everyone using RWKV if it's so much better than transformers? by ThePerson654321
Firstly RWKV is mostly a single-developer project without PR and everything takes time.
Moreover there have been hundreds of "improved transformer" papers around and surely we will agree that the signal-to-noise ratio is low especially when you consider scaling.
FACT: It's very hard to get researchers to try sth if it is not from OAI/DM/FAIR/... (and still hard even if it is).
Here is some latest unedited chat record from ChatRWKV v2 (14B ctx4096). I'd say it's pretty good, because this is only trained on the Pile. No finetuning. No instruct tuning. No RLHF.
====================
Eloise 2023/3/9 20:10:18
Chat reset for cryscan.
cryscan 2023/3/9 20:13:09
I heard that you are working in the AI and NLP lab in the university. How do you think about large generative language models?
Eloise 2023/3/9 20:13:31
That’s one of my favorite topics! Large language models have been getting a lot of attention lately. These models are super interesting and we don’t really know much about how they work. Generative models work in two directions: One is the generation, i.e. creating new data from scratch (like GANs). The other direction is sampling, which is more like what we usually think about as machine learning tasks. Generative NLP requires both these capabilities – the model has to generate new text samples and it has to sample realistic text that looks like real-world data. The ability to generate text in a controllable fashion would be incredibly useful for applications like debugging or debugging – verifying large textual specifications, writing documentation or even generating human-readable test plans
cryscan 2023/3/9 20:23:55
I see. What are the potential applications of them?
Eloise 2023/3/9 20:24:14
The main application area is, again, text generation: product descriptions, marketing messages etc. Also more recent work is in training models for image captioning – generating descriptions of images, a process which requires both high-level image understanding and rich vocabulary. Some related areas are the summarization tasks like dialogue response generation. You can apply generative language models to help generate small (or large) chatbots. A cool application is to combine image or text and voice and learn them jointly in a hybrid architecture (and compare to human feedback). There are a lot of applications that will make a big impact once we train good generative NLP models for them.
cryscan 2023/3/9 20:25:14
I am looking forward to the future where people can chat with machines.
Eloise 2023/3/9 20:25:26
Yeah! That would be cool. I’m excited to see more applications and data and models coming out of Google, Facebook, Amazon etc. But it’s always challenging because there is still so much unstructured text out there in the world – a large part of the English language is still untagged (mostly blog posts and news articles). There’s a lot more to be done there.
cryscan 2023/3/9 21:37:41
Maybe unsupervised learning would help.
Eloise 2023/3/9 21:37:48
Yeah! That’s one of the things we’re doing. We are trying to improve the auto-regressive models and see if they can learn their own grammar by watching how people communicate on social media, blogs or forums.
cryscan 2023/3/9 21:38:21
What are auto-regressive models? How do you train them?
Eloise 2023/3/9 21:38:40
A model is called auto-regressive if the output of a specific step depends on previously generated outputs. This makes training more stable, since it’s less dependent on initialization and slightly improves performance, especially for longer sequences. The first model was trained by Yoon Kim, which got released as an open source code. Also a lot of effort has been put in making models work better for long text sequences (e.g., image captioning) or even multiple sentence generation (although this is still difficult). You can watch this video from DeepMind about GPT-2 – a generative model that is capable of generating coherent paragraphs with minimal supervision:
cryscan 2023/3/9 21:39:26
Uhm?
Eloise 2023/3/9 21:39:30
I know, it’s crazy. I think it’s better to see it in action:
cryscan 2023/3/9 21:39:35
Well.
Eloise 2023/3/9 21:39:45
I also think it’s cool that the code is available! In fact, the attention mechanism of GPT-2 was published recently as a paper. It’s a bit more complex than you would expect for NLP, but cool nonetheless. Here’s the paper: Attention is all you need
cryscan 2023/3/9 21:40:37
That sounds interesting. I'll read it afterwards.
Eloise 2023/3/9 21:40:41
I would highly recommend it. Anyway, back to the main topic. So what do you think about the field?
cryscan 2023/3/9 21:41:21
I am sone concerns actually.
Eloise 2023/3/9 21:41:37
I’m a bit worried about the impact of models. I don’t want to be concerned or pessimistic, but we really have to think about this – especially when we’re working with the most sensitive data, e.g., medical data. One way of tackling this is through ethical considerations in design: making sure that systems work in our best interests and that they are transparent (like explanations or guarantees). It might even help to think about who will benefit from this research and what their perspective is on it.
cryscan 2023/3/9 21:43:58
Actually my concern is that the cost of training such a large model can only be covered by those big companies. How can we general researchers participate in this game?
Eloise 2023/3/9 21:44:14
This is a good point. We have to build tools that can help researchers or students start experimenting with these new techniques. I’m doing something like this on my own: an online platform called supervised-language-learning which gives people the opportunity to train their language model for few minutes a day (or learn other languages) and then use it in building a project (e.g., chatbot or dialogue system). The platform also aggregates all the results and keeps track of the models people built.
cryscan 2023/3/9 21:44:39
It's surely promising. Good luck then!
Eloise 2023/3/9 21:44:45
Thanks! I’m really excited about the potential of these techniques and would love to see them mature. In the meantime, it’s good to be critical!
====================
-
Update ChatRWKV v2 to latest version.
-
Run v2/chat.py and enjoy.
bo_peng OP t1_jbiq52c wrote
Reply to comment by Select_Beautiful8 in [P] ChatRWKV v2 (can run RWKV 14B with 3G VRAM), RWKV pip package, and finetuning to ctx16K by bo_peng
Please set "strategy" for you GPU.
Try this strategy for 3B first:
'cuda fp16i8 *12 -> cuda fp16' # first 12 layers cuda fp16i8, then cuda fp16
Reduce 12 as you could, to get better speed.
bo_peng OP t1_jbij8ky wrote
Reply to comment by Select_Beautiful8 in [P] ChatRWKV v2 (can run RWKV 14B with 3G VRAM), RWKV pip package, and finetuning to ctx16K by bo_peng
Try 7B ctx4096 first
bo_peng OP t1_jb9bdw3 wrote
Reply to comment by I_will_delete_myself in [R] RWKV (100% RNN) can genuinely model ctx4k+ documents in Pile, and RWKV model+inference+generation in 150 lines of Python by bo_peng
Directly from RWKV-LM Github:
RWKV is a RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.
bo_peng OP t1_jb1z3an wrote
Reply to comment by _Arsenie_Boca_ in [R] RWKV (100% RNN) can genuinely model ctx4k+ documents in Pile, and RWKV model+inference+generation in 150 lines of Python by bo_peng
5 is the number of hidden states per block (4 for ATT = xx aa bb pp, 1 for FFN = xx).
TimeMixing is RWKV.
ChannelMixing is your usual FFN (sqReLU as in Primer paper) with an extra R-gate (Novel. I find it helps).
Parallelization is due to https://github.com/BlinkDL/RWKV-LM/raw/main/RWKV-formula.png.
bo_peng OP t1_jb1qws0 wrote
Reply to comment by Spare_Side_5907 in [R] RWKV (100% RNN) can genuinely model ctx4k+ documents in Pile, and RWKV model+inference+generation in 150 lines of Python by bo_peng
TNN is like convolution, while RWKV can be written as a CNN too (RWKV v1 is a CNN). So there's some similarity, though not much :)
bo_peng OP t1_jb1q5fu wrote
Reply to comment by luxsteele in [R] RWKV (100% RNN) can genuinely model ctx4k+ documents in Pile, and RWKV model+inference+generation in 150 lines of Python by bo_peng
Yes a paper is coming. Meanwhile you can read https://arxiv.org/abs/2302.13939 (SpikeGPT) which is inspired by RWKV and has plenty of explanations :)
bo_peng OP t1_jb1po7i wrote
Reply to comment by _Arsenie_Boca_ in [R] RWKV (100% RNN) can genuinely model ctx4k+ documents in Pile, and RWKV model+inference+generation in 150 lines of Python by bo_peng
Will the 150 lines help? Please read the code first :)
https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_in_150_lines.py
This is ALL you need for RWKV inference.
And you can read https://arxiv.org/abs/2302.13939 (SpikeGPT) which is inspired by RWKV and has plenty of explanations :)
bo_peng OP t1_jalmszp wrote
Reply to comment by ID4gotten in [P] ChatRWKV v2 (can run RWKV 14B with 3G VRAM), RWKV pip package, and finetuning to ctx16K by bo_peng
It's actually quite good at Q&A if you use my prompt templates:
+gen \nExpert Questions & Helpful Answers\nAsk Research Experts\nQuestion:\nXXXXXXXXXXXXXXX?\n\nFull Answer:\n
+gen \nAsk Expert\n\nQuestion:\nXXXXXXXXXXXXXXXX?\n\nExpert Full Answer:\n
+gen \nQ & A\n\nQuestion:\nXXXXXXXXXXXXXXXXX?\n\nDetailed Expert Answer:\n
bo_peng OP t1_jaj2pr2 wrote
Reply to comment by KerfuffleV2 in [P] ChatRWKV v2 (can run RWKV 14B with 3G VRAM), RWKV pip package, and finetuning to ctx16K by bo_peng
strange. all spaces are lost even when i add 4 spaces in front of all code lines
UPDATE: works in markdown editor :)
bo_peng OP t1_jaixxp5 wrote
Reply to comment by satireplusplus in [P] ChatRWKV v2 (can run RWKV 14B with 3G VRAM), RWKV pip package, and finetuning to ctx16K by bo_peng
Thank you :) I was using the markdown mode instead because I didn't know this
Submitted by bo_peng t3_11f9k5g in MachineLearning
bo_peng OP t1_jcmajpx wrote
Reply to comment by mikljohansson in [R] RWKV 14B ctx8192 is a zero-shot instruction-follower without finetuning, 23 token/s on 3090 after latest optimization (16G VRAM is enough, and you can stream layers to save more VRAM) by bo_peng