Submitted by ThePerson654321 t3_11lq5j4 in MachineLearning
LetterRip t1_jbks0mg wrote
Reply to comment by Aran_Komatsuzaki in [D] Why isn't everyone using RWKV if it's so much better than transformers? by ThePerson654321
> I've compared Pythia (GPT-3 variants) w/ context length = 2048 vs. RWKV w/ context length = 4096 of comparable compute budget, and the former scored clearly better perplexity on the tokens after the first 1024 tokens, while the former scores better for the first 1024 tokens. While RWKV performs well on the tasks with short context (e.g. the tasks used in its repo for evaluating the RWKV), it may still not be possible to replace Transformer for longer context tasks (e.g. typical conversation with ChatGPT).
Thanks for sharing your results. It is being tuned to longer context lengths, current is
RWKV-4-Pile-14B-20230228-ctx4096-test663.pth
https://huggingface.co/BlinkDL/rwkv-4-pile-14b/tree/main
There should soon be a 6k and 8k as well.
So hopefully you should see better results with longer contexts soon.
> and the former scored clearly better perplexity on the tokens after the first 1024 tokens, while the former scores better for the first 1024 tokens.
Could you clarify - was one of those meant to be former and the other later?
Aran_Komatsuzaki t1_jbkyegs wrote
> Thanks for sharing your results. It is being tuned to longer context lengths, current is
I tried the one w/ context length = 4096 for RWKV :)
> Could you clarify - was one of those meant to be former and the other late
Sorry for the typo. The latter 'former' is meant to be the 'latter'.
Viewing a single comment thread. View all comments