bo_peng OP t1_iwts867 wrote
Reply to comment by Ford_O in [R] RWKV-4 7B release: an attention-free RNN language model matching GPT-J performance (14B training in progress) by bo_peng
RWKV-3 1.5B on A40 (tf32) = always 0.015 sec/token, tested using simple pytorch code (no CUDA), GPU utilization 45%, VRAM 7823M
GPT2-XL 1.3B on A40 (tf32) = 0.032 sec/token (for ctxlen 1000), tested using HF, GPU utilization 45% too (interesting), VRAM 9655M
Moreover RWKV-4 is bf16 and faster than 16bit GPT models.
Training speed: RWKV-4 1.5B BF16 ctxlen1024 = 106K tokens/s on 8xA100 40G.
Ford_O t1_iwtx6nb wrote
Could you also measure the performance on CPU?
ThePerson654321 t1_iwy0ki4 wrote
So again. What is the disadvantage with using your method?
Viewing a single comment thread. View all comments