Hi everyone. I have finished training RWKV-4 7B (an attention-free RNN LLM) and it can match GPT-J (6B params) performance. Maybe RNN is already all you need :)

https://preview.redd.it/71cce2y75j0a1.png?width=1336&format=png&auto=webp&s=5af76abc4f42fd63f0194ee93f78db01c1b21d97

These are RWKV BF16 numbers. RWKV 3B is better than GPT-neo 2.7B on everything (smaller RWKV lags behind on LAMBADA). Note GPT-J is using rotary and thus quite better than GPT-neo, so I expect RWKV to surpass it when both are at 14B.

Previous discussion: https://www.reddit.com/r/MachineLearning/comments/xfup9f/r_rwkv4_scaling_rnn_to_7b_params_and_beyond_with/

RWKV has both RNN & GPT mode. The RNN mode is great for inference. The GPT mode is great for training. Both modes are faster than usual transformer and saves VRAM, because the self-attention mechanism is replaced by simpler (almost linear) formulas. Moreover the hidden state is tiny in the RNN mode and you can use it as an embedding of the whole context.

Github: https://github.com/BlinkDL/RWKV-LM

Checkpt: https://huggingface.co/BlinkDL/rwkv-4-pile-7b

14B in progress (thanks to EleutherAI and Stability). Nice spike-free loss curves:

https://preview.redd.it/w4g7oqmi5j0a1.png?width=868&format=png&auto=webp&s=346d420fb879fd06470079eeaf2e4d3739536406

Comments

ChuckSeven t1_iwqey5x wrote on November 17, 2022 at 4:04 PM

#555,456

what is the size of the opt model you are comparing with in that table?

CKtalon t1_iwqk0b9 wrote on November 17, 2022 at 4:37 PM

#555,746

Replying to ChuckSeven (#555,456)

It’s written in the 2nd column (params)

bo_peng OP t1_iwqlumm wrote on November 17, 2022 at 4:50 PM

#555,868

Replying to ChuckSeven (#555,456)

OPT 6.7B

Competitive-Rub-1958 t1_iwqmaic wrote on November 17, 2022 at 4:53 PM

#555,891

Replying to ChuckSeven (#555,456)

It does need more parameters to compensate (For instance, it has nearly a billion more parameters than GPT-J-6B without substantial performance gains) while losing out on LAMBADA (Ignoring the weighted average as I don't really understand the point of weighing it, since it distorts the metrics).

Its an extremely interesting direction, but I fear as you scale this model the scaling plot might start to flatten out - much like other RNN rewrites/variants. Hope further research is able to pinpoint the underlying issue and fix it. Till then, best of luck to OP! 👍

clauwen t1_iwqwjd0 wrote on November 17, 2022 at 5:59 PM

#556,631

I have to say, i really like that somebody is doing this, no matter the outcome.

m98789 t1_iwr0bdt wrote on November 17, 2022 at 6:24 PM

#556,871

How can it be used in a multi-label text classification task?

VinnyVeritas t1_iwreqw0 wrote on November 17, 2022 at 7:59 PM

#557,923

Really cool stuff!

violentdeli8 t1_iwrvqkf wrote on November 17, 2022 at 9:52 PM

#559,111

I cannot say how much I love that you are doing this!

OverleafMan t1_iwrxwt5 wrote on November 17, 2022 at 10:07 PM

#559,257

Great results! What is the meaning of the first row in your table?

Ford_O t1_iwtrw98 wrote on November 18, 2022 at 7:52 AM

#564,075

How much faster is RNN on inference than GPTJ?

bo_peng OP t1_iwts867 wrote on November 18, 2022 at 7:56 AM

#564,095

Replying to Ford_O (#564,075)

RWKV-3 1.5B on A40 (tf32) = always 0.015 sec/token, tested using simple pytorch code (no CUDA), GPU utilization 45%, VRAM 7823M

GPT2-XL 1.3B on A40 (tf32) = 0.032 sec/token (for ctxlen 1000), tested using HF, GPU utilization 45% too (interesting), VRAM 9655M

Moreover RWKV-4 is bf16 and faster than 16bit GPT models.

Training speed: RWKV-4 1.5B BF16 ctxlen1024 = 106K tokens/s on 8xA100 40G.

bo_peng OP t1_iwtscuw wrote on November 18, 2022 at 7:58 AM

#564,100

Replying to m98789 (#556,871)

You can try the RNN hidden state

Ford_O t1_iwtx6nb wrote on November 18, 2022 at 9:09 AM

#564,324

Replying to bo_peng (#564,095)

Could you also measure the performance on CPU?

Ford_O t1_iwtxi5i wrote on November 18, 2022 at 9:14 AM

#564,341

How much smaller are the embeddings?

bo_peng OP t1_iwua2xh wrote on November 18, 2022 at 12:04 PM

#565,088

Replying to Competitive-Rub-1958 (#555,891)

RWKV 7B is faster than GPT 6B, and RWKV scales great actually :)

If you check the table, RWKV is better than GPT-neo on everything at 3B (while smaller RWKV lags behind on LAMBADA).

But GPT-J is using rotary and thus quite better than GPT-neo, so I expect RWKV to surpass it at 14B.

Moreover RWKV 3B becomes stronger after trained for more tokens and I am doing it for the 7B model too.

guardiantesla t1_iwum4fl wrote on November 18, 2022 at 1:58 PM

#566,107

Interesting work. Appreciate your effort. There are few works which use convolutions as well (referred as ConFormer). But I’m not sure if it has been tried in comparing with GPT works.

How do you train such large models (AWS, GCP, etc)? And how much is the estimated cost?

yazriel0 t1_iwvipkh wrote on November 18, 2022 at 5:46 PM

#569,044

Great stuff, and much needed!! Transformer are expensive.

Is the RNN mode suitable for update-able-neural-net NNEU used in tree-search games? This is where the next tree node evaluation re-uses the previous node.

WikiSummarizerBot t1_iwvir6n wrote on November 18, 2022 at 5:46 PM

#569,052

Replying to yazriel0 (#569,044)

Efficiently updatable neural network

>An efficiently updatable neural network (NNUE, a Japanese wordplay on Nue, sometimes stylised as ƎUИИ) is a neural network-based evaluation function whose inputs are piece-square tables, or variants thereof like the king-piece-square table. NNUE is used primarily for the leaf nodes of the alpha–beta tree. While being slower than handcrafted evaluation functions, NNUE does not suffer from the 'blindness beyond the current move' problem. NNUE was invented by Yu Nasu and introduced to computer shogi in 2018.

^([ )^(F.A.Q)^( | )^(Opt Out)^( | )^(Opt Out Of Subreddit)^( | )^(GitHub)^( ] Downvote to remove | v1.5)

turingbook t1_iww0d9x wrote on November 18, 2022 at 7:46 PM

#570,573

Why not writing a paper?

Sylv__ t1_iww823x wrote on November 18, 2022 at 8:39 PM

#571,168

Plot twist: the model getting integrated in transformers lib ( ͡° ͜ʖ ͡°)

ThePerson654321 t1_iwy0ki4 wrote on November 19, 2022 at 5:28 AM

#575,744

Replying to bo_peng (#564,095)

So again. What is the disadvantage with using your method?

[R] RWKV-4 7B release: an attention-free RNN language model matching GPT-J performance (14B training in progress)

Comments

ChuckSeven t1_iwqey5x wrote on November 17, 2022 at 4:04 PM

CKtalon t1_iwqk0b9 wrote on November 17, 2022 at 4:37 PM

bo_peng OP t1_iwqlumm wrote on November 17, 2022 at 4:50 PM

Competitive-Rub-1958 t1_iwqmaic wrote on November 17, 2022 at 4:53 PM

clauwen t1_iwqwjd0 wrote on November 17, 2022 at 5:59 PM

m98789 t1_iwr0bdt wrote on November 17, 2022 at 6:24 PM

VinnyVeritas t1_iwreqw0 wrote on November 17, 2022 at 7:59 PM

violentdeli8 t1_iwrvqkf wrote on November 17, 2022 at 9:52 PM

OverleafMan t1_iwrxwt5 wrote on November 17, 2022 at 10:07 PM

Ford_O t1_iwtrw98 wrote on November 18, 2022 at 7:52 AM

bo_peng OP t1_iwts867 wrote on November 18, 2022 at 7:56 AM

bo_peng OP t1_iwtscuw wrote on November 18, 2022 at 7:58 AM

Ford_O t1_iwtx6nb wrote on November 18, 2022 at 9:09 AM

Ford_O t1_iwtxi5i wrote on November 18, 2022 at 9:14 AM

bo_peng OP t1_iwua2xh wrote on November 18, 2022 at 12:04 PM

guardiantesla t1_iwum4fl wrote on November 18, 2022 at 1:58 PM

yazriel0 t1_iwvipkh wrote on November 18, 2022 at 5:46 PM

WikiSummarizerBot t1_iwvir6n wrote on November 18, 2022 at 5:46 PM

turingbook t1_iww0d9x wrote on November 18, 2022 at 7:46 PM

Sylv__ t1_iww823x wrote on November 18, 2022 at 8:39 PM

bo_peng OP t1_iwwapqr wrote on November 18, 2022 at 8:57 PM

ThePerson654321 t1_iwy0ki4 wrote on November 19, 2022 at 5:28 AM