Aran_Komatsuzaki
Aran_Komatsuzaki t1_jbkjgzf wrote
Reply to [D] Why isn't everyone using RWKV if it's so much better than transformers? by ThePerson654321
I've compared Pythia (GPT-3 variants) w/ context length = 2048 vs. RWKV w/ context length = 4096 of comparable compute budget, and the former scored clearly better perplexity on the tokens after the first 1024 tokens, while the latter scores better on the first 1024 tokens. While RWKV performs comparably to Tranformer on the tasks with short context (e.g. the tasks used in its repo for evaluating the RWKV), it may still not be possible to replace Transformer for longer context tasks (e.g. typical conversation with ChatGPT).
RWKV has fast decoding speed, but multiquery attention decoding is nearly as fast w/ comparable total memory use, so that's not necessarily what makes RWKV attractive. If you set the context length 100k or so, RWKV would be faster and memory-cheaper, but it doesn't seem that RWKV can utilize most of the context at this range, not to mention that the vanilla attention is also not feasible at this range.
Aran_Komatsuzaki t1_jbkyegs wrote
Reply to comment by LetterRip in [D] Why isn't everyone using RWKV if it's so much better than transformers? by ThePerson654321
> Thanks for sharing your results. It is being tuned to longer context lengths, current is
I tried the one w/ context length = 4096 for RWKV :)
> Could you clarify - was one of those meant to be former and the other late
Sorry for the typo. The latter 'former' is meant to be the 'latter'.