Viewing a single comment thread. View all comments

farmingvillein t1_jbk6nut wrote

> Based on my comprehension of this model, it appears to offer a distinct set of advantages relative to transformers

What advantages are you referring to, very specifically?

There are theoretical advantages--but it can be a lot of work to prove out that those matter.

There are (potentially) empirical, observed advantages--but there don't seem to be (yet) any claims that are so strong as to suggest a paradigm shift (like Transformers were).

Keep in mind that there is a lot of infrastructure built up to support transformers in an industrial context, which means that even if RWKV shows some small advantage, that the advantage may not be there in practice, because of all the extreme optimizations that have been built to support larger organizations (in speed of inference, training, etc.).

The most likely adoption path here would be if multiple papers showed, at smaller scale, consistent advantages for RWKV. No one has done this yet--and the performance metrics provided on the github (https://github.com/BlinkDL/RWKV-LM) certainly don't make such an unequivocal claim on performance.

And providing a rigorous side-by-side comparison with transformers is actually really, really hard--apples to apples comparisons are notoriously tricky, and you of course have to be really cautious about thinking about what "tips and tricks" you allow both architectures to leverage.

Lastly, and this is a fuzzier but IMO I think relevant point--

The biggest guys are crossing into a point where evaluation is suddenly hard again.

By that, what I mean is that there is broad consensus that our current public evaluation metrics don't do a great job of helping us understand how well these models perform on "more interesting" generative tasks. I think you'll probably see some major improvements around eval/benchmark management in the next year or so (and certainly, internally, the big guys have invested a lot here)--but for now, it is harder to pick up a new architecture/model and understand its capabilities in the "more interesting" capabilities that your GPT-4s & Bards of the world are trying to demonstrate. This makes it harder to prove and vet progress on smaller models, which of course makes scaling up more risky.

6

ThePerson654321 OP t1_jbk8kxy wrote

I'm basically just referring to the claims by the developer. He makes it sound extraordinary:

> best of RNN and transformer, great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.

> Inference is very fast (only matrix-vector multiplications, no matrix-matrix multiplications) even on CPUs, so you can even run a LLM on your phone.

The most extraordinary claim I got stuck up on was "infinite" ctx_len. One of the biggest limitations of transformers today is imo their context length. Having an "infinite" ctx_len definitely feels like something DeepMind, OpenAi etc would want to investigate?


I definitely agree with that their might be a incompatibility with the already existing transformer specific infrastructure.

But thanks for your answer. It might be one or more of the following:

  1. The larger organizations hasn't noticed/cared about it yet
  2. I overestimate how good it is (from the developer's description)
  3. It has some unknown flaw that's not obvious to me and not stated in the repository's ReadMe.
  4. All the existing infrastructure is tailored for transformers and is not compatible with RWKV

At least we'll see in time.

0

farmingvillein t1_jbkwkgl wrote

> most extraordinary claim I got stuck up on was "infinite" ctx_len.

All RNNs have that capability, on paper. But the question is how well does the model actually remember and utilize things that happened a long time ago (things that happened beyond the the window that a transformer has, e.g.). In simpler RNN models, the answer is usually "not very".

Which doesn't mean that there can't be real upside here--just that it is not a clear slam-dunk, and that it has not been well-studied/ablated. And obviously there has been a lot of work in extending transformer windows, too.

5

LetterRip t1_jbkmk5e wrote

> He makes it sound extraordinary

The problem is that extraordinary claims raise the 'qwack' suspicion when there isn't much evidence provided in support.

> The most extraordinary claim I got stuck up on was "infinite" ctx_len. One of the biggest limitations of transformers today is imo their context length. Having an "infinite" ctx_len definitely feels like something DeepMind, OpenAi etc would want to investigate?

Regarding the infinite context length - that is for inference and it is more accurately stated as not having a fixed context length. While infinite "in theory" in practice the 'effective context length' is about double the trained context length,

> It borrows ideas from Attention Free Transformers, meaning the attention is a linear in complexity. Allowing for infinite context windows.

> Blink DL mentioned that when training with GPT Mode with a context length of 1024, he noticed that RWKV_RNN deteriorated around a context length of 2000 so it can extrapolate and compress the prompt context a bit further. This is due to the fact that the model likely doesn't know how to handle samples beyond that size. This implies that the hidden state allows for the the prompt context to be infinite, if we can fine tune it properly. ( Unclear right now how to do so )

https://github.com/ArEnSc/Production-RWKV

3