Submitted by bo_peng t3_yxt8sa in MachineLearning
Hi everyone. I have finished training RWKV-4 7B (an attention-free RNN LLM) and it can match GPT-J (6B params) performance. Maybe RNN is already all you need :)
These are RWKV BF16 numbers. RWKV 3B is better than GPT-neo 2.7B on everything (smaller RWKV lags behind on LAMBADA). Note GPT-J is using rotary and thus quite better than GPT-neo, so I expect RWKV to surpass it when both are at 14B.
Previous discussion: https://www.reddit.com/r/MachineLearning/comments/xfup9f/r_rwkv4_scaling_rnn_to_7b_params_and_beyond_with/
RWKV has both RNN & GPT mode. The RNN mode is great for inference. The GPT mode is great for training. Both modes are faster than usual transformer and saves VRAM, because the self-attention mechanism is replaced by simpler (almost linear) formulas. Moreover the hidden state is tiny in the RNN mode and you can use it as an embedding of the whole context.
Github: https://github.com/BlinkDL/RWKV-LM
Checkpt: https://huggingface.co/BlinkDL/rwkv-4-pile-7b
14B in progress (thanks to EleutherAI and Stability). Nice spike-free loss curves:
ChuckSeven t1_iwqey5x wrote
what is the size of the opt model you are comparing with in that table?