currentscurrents t1_j4rcc3e wrote on January 17, 2023 at 6:54 PM

Interesting! I haven't heard of RWKV before.

Getting rid of attention seems like a good way to increase training speed (since training all those attention heads at once is slow), but how can it work so well without attention?

Also aren't RNNs usually slower than transformers because they can't be parallelized?

bo_peng OP t1_j4rht4i wrote on January 17, 2023 at 7:28 PM

RWKV is a RNN that also works as a linear transformer (or we may say it's a linear transformer that also works as a RNN). So it has both parallel & serial mode, and you get the best of both worlds (fast and saves VRAM).

Almost all such "linear transformers" are bad at language modeling, but RWKV is the exception. The basic idea is a bit similar to https://arxiv.org/abs/2105.14103. Then I added lots of new ideas :)

_Arsenie_Boca_ t1_j4rxdt8 wrote on January 17, 2023 at 9:04 PM

Is there some more detailed description? Would be interesting to read about these lots of new ideas :)

currentscurrents t1_j4s2n9t wrote on January 17, 2023 at 9:36 PM

It looks like he goes into a lot more detail on his github.

mrconter1 t1_j4wq1zs wrote on January 18, 2023 at 8:06 PM

How does the memory scale with the context window size?