Viewing a single comment thread. View all comments

LetterRip t1_jbks0mg wrote

> I've compared Pythia (GPT-3 variants) w/ context length = 2048 vs. RWKV w/ context length = 4096 of comparable compute budget, and the former scored clearly better perplexity on the tokens after the first 1024 tokens, while the former scores better for the first 1024 tokens. While RWKV performs well on the tasks with short context (e.g. the tasks used in its repo for evaluating the RWKV), it may still not be possible to replace Transformer for longer context tasks (e.g. typical conversation with ChatGPT).

Thanks for sharing your results. It is being tuned to longer context lengths, current is

RWKV-4-Pile-14B-20230228-ctx4096-test663.pth

https://huggingface.co/BlinkDL/rwkv-4-pile-14b/tree/main

There should soon be a 6k and 8k as well.

So hopefully you should see better results with longer contexts soon.

> and the former scored clearly better perplexity on the tokens after the first 1024 tokens, while the former scores better for the first 1024 tokens.

Could you clarify - was one of those meant to be former and the other later?

3

Aran_Komatsuzaki t1_jbkyegs wrote

> Thanks for sharing your results. It is being tuned to longer context lengths, current is

I tried the one w/ context length = 4096 for RWKV :)

> Could you clarify - was one of those meant to be former and the other late

Sorry for the typo. The latter 'former' is meant to be the 'latter'.

2