Viewing a single comment thread. View all comments

MysteryInc152 t1_j8oj9qx wrote

Fantastic work. Thanks for doing this. Good luck scaling to 24b. I hope more catch on because the lack of a limited context length is a game changer.

17

farmingvillein t1_j8p269l wrote

> I hope more catch on because the lack of a limited context length is a game changer.

I'd be cautious about concluding this, without more testing.

RNNs, in some theoretical sense, support infinite context more easily than N^2 transformers; in practice, their effective "context window" often doesn't look much different than a reasonable transformer, when we look at performance metrics against long sequences.

30

maizeq t1_j8p3f1s wrote

Any papers I can refer to that for that last paragraph? I expect it is true but would love to see some empirical work.

8

farmingvillein t1_j8p7qa8 wrote

Any of the papers that address building NLP for long contexts will tend to have a relevant related works section. E.g., https://arxiv.org/pdf/2109.00301.pdf.

(The one qualifier here is that, at "modern" scale, RNNs have not really been well-tested (since people tend to just use...transformers). So, maaaybe they are actually simply superior. Evidence so far says "doubtful", however (at least for more vanilla implementations).)

12

gwern t1_j8ph8as wrote

I don't think the Related Works section of that paper provides any useful references. It simply provides doodads people claim help memory without papers showing that the memory doesn't work.

5

farmingvillein t1_j8piz80 wrote

Not clear to me what you are looking for here.

> It simply provides doodads people claim help memory without papers showing that the memory doesn't work.

The very first reference I pulled, Graves 2014, specifically compares w/ and w/o memory.

Or Dai et al, which tries to compare against various RNN-style baselines with similar parameters.

Perhaps we're talking past each other?

2

gwern t1_j8psc8m wrote

> Not clear to me what you are looking for here.

The question asked was pretty clear, to justify the statement:

>> in practice, their effective "context window" often doesn't look much different than a reasonable transformer, when we look at performance metrics against long sequences.

Simply comparing RNNs with and RNNs without memory doesn't tell you anything about how fast the memory fades out and that it never winds up being bigger than a Transformer. For example, you could construct a toy problem which requires memory reaching back exactly 1 state, and show that an arch with any memory outperforms memory-less arch; this would obviously tell you nothing of interest like 'this memory makes little use of history further back than 50 steps and none past 200 (and so is easily outperformed by history-stacking like a Transformer)'. Nor does comparing a Transformer with a history of say l=500 and an RNN, and the Transformer winning, tell you anything about why the RNN lost - ok, the Transformer did better, great, we have a superior new tool, but why? maybe it has similar memory problems and is just way better at the modeling part or memorizes better or something entirely different.

Likewise, unless you are comparing RNN baselines which somehow have known hard history constraints, they cannot tell you anything useful about how fast the effective memory fades out, how the accuracy of the memory is 'distributed' over the effective context window, if there are hard cutoffs, if the RNN is basically only using the last few states and so on.

In contrast, a Transformer has direct shortcut access to the history (we don't need any paper to know this, literally any GPT output exhibiting coherent long-range references past a few paragraphs demonstrates this directly), and so if you show that an RNN uses primarily the past 50 steps and simply 'fades out' completely past 200 steps and so the 'infinite history' is meaningless in practice, well, we know perfectly well that Transformers make excellent use of context windows larger than 50 or 200 tokens (as my two references show), so a direct comparison is otiose. Directly examining a RNN's understanding of its history, as those papers do, is much better than some higher-level performance comparison, which is what most of those referenced papers do; direct performance comparisons are great, but do not ablate where the problem is on the RNN's end. (Although if I really needed one, I would prefer to point at the RNN vs Transformer scaling laws in context window anyway, like Kaplan et al 2020 IIRC, to show that the Transformers are making good use of it, not merely some sort of better-than-RNN use or gains elsewhere.)

4

farmingvillein t1_j8qipd4 wrote

Let's think step by step:

You:

> I don't think the Related Works section of that paper provides any useful references.

Your own response to the question that was posed:

> https://arxiv.org/abs/1805.04623 > https://arxiv.org/abs/1702.04521

There is no possible way that you actually read the Related Works section you dismissed, given that the papers you cited are already covered in the same references you dismissed.

E.g., "Sharp Nearby, Fuzzy Far Away" is directly discussed in the cited "Transformer-XL":

> Empirically, previous work has found that LSTM language models use 200 context words on average (Khandelwal et al., 2018), indicating room for further improvement

> Simply comparing RNNs with and RNNs without memory doesn't tell you anything about how fast the memory fades out and that it never winds up being bigger than a Transformer

I never said this, so I'm not sure what your argument is.

> we know perfectly well that Transformers make excellent use of context windows larger than 50 or 200 tokens (as my two references show)

Neither of the papers you link to (assuming you are talking about your own comment at https://www.reddit.com/r/MachineLearning/comments/1135aew/r_rwkv4_14b_release_and_chatrwkv_a_surprisingly/j8pg3g7/) make any reference to Transformers.

If your claim is that the papers indicated that RNNs have a small window (sure) and that Transformers have a longer one, you're arguing (as you seem to be in your entire post) again against a strawman. Re-read what I actually wrote:

> in practice, their effective "context window" often doesn't look much different than a reasonable transformer, when we look at performance metrics against long sequences.

My statement here is an empirical one around performance--which, among other things, is why I reference Dai et al, who (among others!) do a fairly extensive breakdown of empirical performance differences of RNNs- versus transformer-type architectures against long text sequences.

The whole point is that an OP said that RNNs were attractive because of the theoretical infinite context--but my response was that 1) we don't really see that in practice, when we try to measure it directly (as both of our sources point out), and 2) we don't see evidence of superior long-distance behavior when testing against real-world(ish) data sets that should theoretically reward that. And that both of these points are encapsulated if you follow the reference I shared (or, as I noted, most reasonable "long-distance transformer" papers).

(As with all things research...someone may come out with a small modification tomorrow that invalidates everything above--but, for now, it represents the broad public (i.e., non-private) understanding of architecture behaviors.)

−1

gwern t1_j8s2du5 wrote

> There is no possible way that you actually read the Related Works section you dismissed, given that the papers you cited are already covered in the same references you dismissed.

Telling someone to read the Related Works section of every one of a dozen papers in the Related Works section of a paper is a ridiculous thing to suggest, and no, I did not recurse down n deep in a breadth-first search. I read the Related Works of that paper, as I said ("I don't think the Related Works section of that paper"), noted that they were a bunch of memory-related papers which might or might not cite the actually relevant research I had in mind, but life was too short to queue up a dozen papers just to check their RW when I already knew some useful ones. Giving someone a random reference and telling them to manually crawl the literature is not helpful. In contrast, the two references I provided directly bore on the question, they didn't maybe cite papers which might bury something relevant in a footnote or cite papers which might someday answer the question...

> I never said this, so I'm not sure what your argument is.

I was pointing out why it was irrelevant to bring up a paper which "compares w/ and w/o memory." Mildly interesting but such a comparison cannot show what was asked about the effective memory of RNNs. Of course it is better to have (any) memory than not.

> which, among other things, is why I reference Dai et al, who (among others!) do a fairly extensive breakdown of empirical performance differences of RNNs- versus transformer-type architectures against long text sequences.

Dai would in fact have been useful, had you referenced it in your original comment. Unless you mean, 'vaguely gestured in the direction of a paper which has 50+ references with 35 in the RW section alone, any of which could have been relevant and where the relevant benchmarking of Dai was not highlighted in the paper to begin with, nor is the relative context work mentioned in the abstract of Dai but buried at the end of the paper (with the RNN results hidden inside a table) so you just have to know it's already there, and claimed you 'reference it'.' Then sure, yeah, that was a useful reference. Thanks for the input.

> If your claim is that the papers indicated that RNNs have a small window (sure) and that Transformers have a longer one, you're arguing (as you seem to be in your entire post) again against a strawman.

It's not a strawman. It's not obvious a priori that Transformers would work so much better or that RNN histories fade out so fast, which is why it had to be empirically established that the history fades out completely, as opposed to any of the other reasons that RNNs could underperform (maybe they have history but can't learn a good algorithm exploiting their memory, say, or they could but they are poorly optimized - there are so many ways for NNs to break) and people were surprised by how well Transformers work. It is completely understandable that OP would expect RNN history to work better than it does, and would want some hard citeable evidence that it works so badly that Transformers, with their apparently brutal hard cutoff, wind up having much closer to 'infinite context' than RNNs themselves.

Thus, it's useful to provide references showing that. (Not references to unspecified references which may or may not show that - gl.)

1

farmingvillein t1_j8s7ygo wrote

This...is pretty astounding. Just have the grace to admit you were wrong, and move on.

> Telling someone to read the Related Works section of every one of a dozen papers in the Related Works section of a paper is a ridiculous thing to suggest

Then how can you possibly say:

> I don't think the Related Works section of that paper provides any useful references.

?

This is hardcore trolling. You can, and frequently do, do better than this.

You are literally pushing posts that are factually incorrect, and that you either know are factually incorrect, or are too lazy to validate either way.

This is the type of thing which blows up post quality in this sub.

> Giving someone a random reference and telling them to manually crawl the literature is not helpful.

This...is ridiculous. This is--traditionally--a very academic-friendly sub. This is how research works. "Here is where you can start a literature review on a bundle of related papers" is an extremely classic response which is generally considered helpful to complex and nuanced questions.

And underlying issue is actually very complex, as evidenced in part by the fact that your references do not actually answer the question. "Go read related works" can be obnoxious when there are a single one or two papers that do answer the question--but that is not the case here.

> In contrast, the two references I provided directly bore on the question

No they did not. They did not touch at all upon Transformers versus RNNs, which was the question. You've chosen to cherry-pick one slice of the problem and declare victory.

> It's not a strawman.

You don't seem to understand what a strawman is. Strawman:

> an intentionally misrepresented proposition that is set up because it is easier to defeat than an opponent's real argument.

I was not making this argument. You were making this argument. QED, this a strawman.

2

bo_peng OP t1_j8qhiyk wrote

RWKV is the exception. When you look at loss against token position, it is comparable with transformers.

You can tell that from the generation results too.

3

farmingvillein t1_j8qj1u7 wrote

> RWKV is the exception. When you look at loss against token position, it is comparable with transformers.

Can you link to what you are referring to? If I missed it in the OP post, my apologies.

2

csreid t1_j8p5z30 wrote

But they theoretically support infinite context length. Getting it is a problem to be solved, not a fundamental incompatibility like it is with transformers.

2

farmingvillein t1_j8p7lci wrote

Neither really work for super long contexts, so it is kind of a moot point.

Both--empirically--end up with bolt-on approaches to enhance memory over very long contexts, so it isn't really clear (a priori) that the RNN has a true advantage here.

8