Haycart t1_je4923c wrote on March 29, 2023 at 8:54 AM

>Yes, ChatGPT is doing much more than querying text! It is not just a query engine on a giant corpus of text. … Duh! I do not think you should only think of ChatGPT as a query engine on a giant corpus of text. There can be a lot of value in reasoning about ChatGPT anthropomorphically or in other ways. RLHF also complicates the story, as over time it weighs responses away from the initial training data. But “query engine on a giant corpus of text” should be a non-zero part of your mental model because, without it, you cannot explain many of the things ChatGPT does.

The author seems to present this bizarre dichotomy, that either you have to think of ChatGPT as a query engine or you have to think of it in magical/mystical/anthropomorphic terms.

(They also touch on viewing ChatGPT as a function on the space of "billion dimensional" embeddings. This is closer to the mark but seems to conflate the model's parameter count with the dimensionality of its latent space, which doesn't exactly inspire confidence in the author's level of understanding.)

Why not just think of ChatGPT as what it is--a very large transformer?

The fact that a model like ChatGPT is able to do what it does is not at all surprising, IMO, when you consider the following facts:

Transformers (and neural networks in general) are universal approximators. A sufficiently large neural network can approximate any function to arbitrary precision (with a few minor caveats).
Neural networks trained with stochastic gradient descent benefit from implicit regularization -- SGD naturally tends to seek out simple solutions that generalize well. Furthermore, larger neural networks appear to generalize better than smaller ones.
The recent GPTs have been trained on a non-trivial fraction of the entire internet's text content.
Text on the internet (and language data in general) arises from human beings interacting with the world--reasoning, thinking, and emoting about those interactions--and attempting to communicate the outcome of this process to one another.

Is it really crazy to imagine that the simplest possible function capable of fitting a dataset as vast as ChatGPT's, might resemble the function that produced it? A function that subsumes, among other things, human creativity and reasoning?

In another world, GPT 3 or 4 might have turned out to be incapable of approximating that function to any notable degree of fidelity. But even then, it wouldn't be outlandish to imagine that one of the later members of the GPT family could eventually succeed.

sdmat t1_je4bgwh wrote on March 29, 2023 at 9:30 AM

Exactly, it's bizarre to point to revealing failure cases for a universal approximator then claim that fixing those failure cases in later versions would be irrelevant.

Entirely possible that GPT3 only did interpolation and fails horribly out of domain and that GPT5 will infer the laws of nature, language, psychology, logic, etc and be able to apply them to novel material.

It certainly looks like GPT4 is somewhere in between.

ChuckSeven t1_je55o02 wrote on March 29, 2023 at 2:20 PM

The Transformer is not a universal function approximator. This is simply shown by the fact that it cannot process arbitrary long input due to the finite context limitations.

Your conclusion is not at all obvious or likely given your facts. They seem to be in hindsight given the strong performance of large models.

It's hard to think of chatgpt as a very large transformer ... because we don't know how to think about very large transformers.

Haycart t1_je6grih wrote on March 29, 2023 at 7:22 PM

>The Transformer is not a universal function approximator. This is simply shown by the fact that it cannot process arbitrary long input due to the finite context limitations.

We can be more specific, then: the transformer is a universal function approximator* on the space of sequences that fit within its context. I don't this distinction is necessarily relevant to the point I'm making, though.

*again with caveats regarding continuity etc.

>Your conclusion is not at all obvious or likely given your facts. They seem to be in hindsight given the strong performance of large models.

Guilty as charged, regarding hindsight. I won't claim to have predicted GPT-3's performance a-priori. That said, my point was never that the strong performance we've observed from recent LLMs was obvious or likely--only that it shouldn't be surprising. And, in particular it should not be surprising that a GPT model (not necessarily GPT-3 or 4) trained on a language modeling task would have the abilities we've seen. Everything we've seen falls well within the bounds of what transformers are theoretically capable of doing.

There are, of course, aspects of the current situation specifically that you can be surprised about. Maybe you're surprised that 100 billion-ish parameters is enough, or that the current volume of training data was sufficient. My argument is mostly aimed at claims along the lines of "GPT-n can't do X because transformers lack capability Y" or "GPT-n can't do X because it is only trained to model language".