Aran_Komatsuzaki

Aran_Komatsuzaki t1_jbkjgzf wrote

I've compared Pythia (GPT-3 variants) w/ context length = 2048 vs. RWKV w/ context length = 4096 of comparable compute budget, and the former scored clearly better perplexity on the tokens after the first 1024 tokens, while the latter scores better on the first 1024 tokens. While RWKV performs comparably to Tranformer on the tasks with short context (e.g. the tasks used in its repo for evaluating the RWKV), it may still not be possible to replace Transformer for longer context tasks (e.g. typical conversation with ChatGPT).

RWKV has fast decoding speed, but multiquery attention decoding is nearly as fast w/ comparable total memory use, so that's not necessarily what makes RWKV attractive. If you set the context length 100k or so, RWKV would be faster and memory-cheaper, but it doesn't seem that RWKV can utilize most of the context at this range, not to mention that the vanilla attention is also not feasible at this range.

5