Submitted by bo_peng t3_10eh2f3 in MachineLearning

Hi everyone. I am training my RWKV 14B ( https://github.com/BlinkDL/RWKV-LM ) on the Pile (332B tokens) and it is getting closer to GPT-NeoX 20B level. You can already try the latest checkpoint.

https://preview.redd.it/7ycdftmjvmca1.png?width=1174&format=png&auto=webp&v=enabled&s=1622fb8cd7deb5ccd1934c4cc1d66ce696e81f20

RWKV is a RNN that also works as a linear transformer (or we may say it's a linear transformer that also works as a RNN). So it has both parallel & serial mode, and you get the best of both worlds (fast and saves VRAM).

At this moment, RWKV might be the only pure RNN that scales like usual transformers for language modeling, without using any QKV attention. It's great at preserving long context (unlike LSTM).

Moreover, you get smooth spike-free carefree training experience (bf16 & Adam):

https://preview.redd.it/0g3lrg6mvmca1.png?width=871&format=png&auto=webp&v=enabled&s=76a4b7a4859ec589f19552f8248ccc44f87a8a1d

As a proof of concept, I present ChatRWKV ( https://github.com/BlinkDL/ChatRWKV ). It's not instruct-tuned yet, and there are few conversations in the Pile, so don't expect great quality. But it's already fun. Chat examples (using slightly earlier checkpoints):

https://preview.redd.it/zyqni6bpvmca1.png?width=1084&format=png&auto=webp&v=enabled&s=dd34763778a68d70f4079fe391197b07a885f2e5

https://preview.redd.it/xhje4j7qvmca1.png?width=1200&format=png&auto=webp&v=enabled&s=4622ff3c5538cb16b0801d3215f747b64f083623

And you can chat with the bot (or try free generation) in RWKV Discord (link in Github readme: https://github.com/BlinkDL/RWKV-LM ). This is an open source project and let's build together.

110

Comments

You must log in or register to comment.

blabboy t1_j4ujuqj wrote

Amazing work, I've been following this for a while. Have you considered putting this into an arxiv whitepaper describing the model + tricks? I've wanted to cite this a couple times, but have had to resort to citing the github repo.

15

currentscurrents t1_j4rcc3e wrote

Interesting! I haven't heard of RWKV before.

Getting rid of attention seems like a good way to increase training speed (since training all those attention heads at once is slow), but how can it work so well without attention?

Also aren't RNNs usually slower than transformers because they can't be parallelized?

10

bo_peng OP t1_j4rht4i wrote

RWKV is a RNN that also works as a linear transformer (or we may say it's a linear transformer that also works as a RNN). So it has both parallel & serial mode, and you get the best of both worlds (fast and saves VRAM).

Almost all such "linear transformers" are bad at language modeling, but RWKV is the exception. The basic idea is a bit similar to https://arxiv.org/abs/2105.14103. Then I added lots of new ideas :)

12

mrconter1 t1_j4wq1zs wrote

How does the memory scale with the context window size?

1

femboyxx98 t1_j4vlsfj wrote

Have you compared it against modern transformer implementations e.g. with FlashAttention, which can provide 3x-5x speed up by itself?

5

blimpyway t1_j4ulemc wrote

Prior to this, have you experimenting with smaller (== more manageable) variants of this model or previous variants were attempted directly at this scale?

3

Taenk t1_j4zcu0e wrote

Do I understand correctly that I could run this model at home on a graphics card with 8GB VRAM?

2

chip_0 t1_j5naknl wrote

Have you used RL with Human Feedback to fine-tune it yet?

I have an idea about how to use RLHF without expensive human annotation. Let me know if you would like to collaborate on that!

1

Gody_Godee t1_j5zl5ar wrote

another underperforming linear transformer again? ¯\_(ツ)_/¯

0

timelyparadox t1_j4uig6f wrote

It really wants you to make a chat bot, I think it is self aware and biased

−2