ChuckSeven t1_iwqey5x wrote on November 17, 2022 at 4:04 PM

what is the size of the opt model you are comparing with in that table?

Competitive-Rub-1958 t1_iwqmaic wrote on November 17, 2022 at 4:53 PM

It does need more parameters to compensate (For instance, it has nearly a billion more parameters than GPT-J-6B without substantial performance gains) while losing out on LAMBADA (Ignoring the weighted average as I don't really understand the point of weighing it, since it distorts the metrics).

Its an extremely interesting direction, but I fear as you scale this model the scaling plot might start to flatten out - much like other RNN rewrites/variants. Hope further research is able to pinpoint the underlying issue and fix it. Till then, best of luck to OP! 👍

bo_peng OP t1_iwua2xh wrote on November 18, 2022 at 12:04 PM

RWKV 7B is faster than GPT 6B, and RWKV scales great actually :)

If you check the table, RWKV is better than GPT-neo on everything at 3B (while smaller RWKV lags behind on LAMBADA).

But GPT-J is using rotary and thus quite better than GPT-neo, so I expect RWKV to surpass it at 14B.

Moreover RWKV 3B becomes stronger after trained for more tokens and I am doing it for the 7B model too.

bo_peng OP t1_iwqlumm wrote on November 17, 2022 at 4:50 PM

OPT 6.7B

CKtalon t1_iwqk0b9 wrote on November 17, 2022 at 4:37 PM

It’s written in the 2nd column (params)