Hi everyone. I am training my RWKV 14B ( https://github.com/BlinkDL/RWKV-LM ) on the Pile (332B tokens) and it is getting closer to GPT-NeoX 20B level. You can already try the latest checkpoint.

https://preview.redd.it/7ycdftmjvmca1.png?width=1174&format=png&auto=webp&v=enabled&s=1622fb8cd7deb5ccd1934c4cc1d66ce696e81f20

RWKV is a RNN that also works as a linear transformer (or we may say it's a linear transformer that also works as a RNN). So it has both parallel & serial mode, and you get the best of both worlds (fast and saves VRAM).

At this moment, RWKV might be the only pure RNN that scales like usual transformers for language modeling, without using any QKV attention. It's great at preserving long context (unlike LSTM).

Moreover, you get smooth spike-free carefree training experience (bf16 & Adam):

https://preview.redd.it/0g3lrg6mvmca1.png?width=871&format=png&auto=webp&v=enabled&s=76a4b7a4859ec589f19552f8248ccc44f87a8a1d

As a proof of concept, I present ChatRWKV ( https://github.com/BlinkDL/ChatRWKV ). It's not instruct-tuned yet, and there are few conversations in the Pile, so don't expect great quality. But it's already fun. Chat examples (using slightly earlier checkpoints):

https://preview.redd.it/zyqni6bpvmca1.png?width=1084&format=png&auto=webp&v=enabled&s=dd34763778a68d70f4079fe391197b07a885f2e5

https://preview.redd.it/xhje4j7qvmca1.png?width=1200&format=png&auto=webp&v=enabled&s=4622ff3c5538cb16b0801d3215f747b64f083623

And you can chat with the bot (or try free generation) in RWKV Discord (link in Github readme: https://github.com/BlinkDL/RWKV-LM ). This is an open source project and let's build together.

Comments

blabboy t1_j4ujuqj wrote on January 18, 2023 at 10:31 AM

Amazing work, I've been following this for a while. Have you considered putting this into an arxiv whitepaper describing the model + tricks? I've wanted to cite this a couple times, but have had to resort to citing the github repo.

currentscurrents t1_j4rcc3e wrote on January 17, 2023 at 6:54 PM

Interesting! I haven't heard of RWKV before.

Getting rid of attention seems like a good way to increase training speed (since training all those attention heads at once is slow), but how can it work so well without attention?

Also aren't RNNs usually slower than transformers because they can't be parallelized?

bo_peng OP t1_j4rht4i wrote on January 17, 2023 at 7:28 PM

Almost all such "linear transformers" are bad at language modeling, but RWKV is the exception. The basic idea is a bit similar to https://arxiv.org/abs/2105.14103. Then I added lots of new ideas :)

_Arsenie_Boca_ t1_j4rxdt8 wrote on January 17, 2023 at 9:04 PM

Is there some more detailed description? Would be interesting to read about these lots of new ideas :)

currentscurrents t1_j4s2n9t wrote on January 17, 2023 at 9:36 PM

It looks like he goes into a lot more detail on his github.

mrconter1 t1_j4wq1zs wrote on January 18, 2023 at 8:06 PM

How does the memory scale with the context window size?

femboyxx98 t1_j4vlsfj wrote on January 18, 2023 at 3:58 PM

Have you compared it against modern transformer implementations e.g. with FlashAttention, which can provide 3x-5x speed up by itself?

blimpyway t1_j4ulemc wrote on January 18, 2023 at 10:52 AM

Prior to this, have you experimenting with smaller (== more manageable) variants of this model or previous variants were attempted directly at this scale?

limpbizkit4prez t1_j4sarps wrote on January 17, 2023 at 10:27 PM

What does RWKV stand for?

LetterRip t1_j4sumo7 wrote on January 18, 2023 at 12:42 AM

Receptance Weighted Key Value RWKV

Taenk t1_j4zcu0e wrote on January 19, 2023 at 8:49 AM

Do I understand correctly that I could run this model at home on a graphics card with 8GB VRAM?

SatoshiNotMe t1_j4uqnbg wrote on January 18, 2023 at 11:54 AM

Thanks for sharing. What is the Pile? Never heard of it.

MaiconJLLE t1_j5d4x2x wrote on January 22, 2023 at 3:09 AM

https://pile.eleuther.ai/

chip_0 t1_j5naknl wrote on January 24, 2023 at 5:11 AM

Have you used RL with Human Feedback to fine-tune it yet?

I have an idea about how to use RLHF without expensive human annotation. Let me know if you would like to collaborate on that!

Gody_Godee t1_j5zl5ar wrote on January 26, 2023 at 5:46 PM

another underperforming linear transformer again? ¯\_(ツ)_/¯

bo_peng OP t1_j61fdtp wrote on January 27, 2023 at 12:54 AM

No. It's highly competitive.

Gody_Godee t1_j6ayw0r wrote on January 29, 2023 at 1:03 AM

your idea looks like this one from 3 years ago: https://arxiv.org/abs/2006.16236

bo_peng OP t1_j6gnqrp wrote on January 30, 2023 at 4:32 AM

2006.16236 is bad at any nontrivial task such as language modeling.

timelyparadox t1_j4uig6f wrote on January 18, 2023 at 10:11 AM

It really wants you to make a chat bot, I think it is self aware and biased