Zermelane t1_iwvefy4 wrote on November 18, 2022 at 5:18 PM

This question really deserves a great answer, and I've tried to write an okay one over two days now, but there's an intuition that I don't really know how to express. Or that might just be wrong, I'm not a ML researcher. But even if it's just disjointed chunks of an argument, here goes anyway:

You can run GPT-2 on a weak GPU and it'll do a great job dealing with language and text as such. On the one hand, that's a non-obvious accomplishment on its own right, see nostalgebraist on GPT-2 being able to write for more on that; but on the other hand, well, when was the last time you actually used GPT-2 for anything?

And the reason why you don't do that is... text models have long ago stopped being about text. By far most of what they model is just, well, everything else. Stories, logic, physical intuition, theory of mind, etc.. GPT-2 can do language, and language is pretty straightforward, but all that other stuff is general intelligence, and general intelligence is very, very hard.

But if you're going to do general intelligence, text is a really great modality. It comes pre-processed by language evolution to have a nice, even, and high rate of communicated information, so that if you just compress it a tiny bit, you get a lot of structure and meaning in a tiny amount of input bits. Which in turn means that you can process those with a model that can just focus right away on the hard parts, and use an even amount of computation for everything, and still not really leave much performance on the table.

Image models on the other hand model far less - just the visual universe of pictures on the internet, no big deal - and you probably aren't trying to get them to pull off anything like the feats of reasoning that you expect from language models. Hence, they can do seemingly a lot with little. I've seen someone pull off having Stable Diffusion outpaint the right side of a blackboard with "1+1=" written on the left side, and I think it did pull off putting in a 2, but that's probably just about the extent of reasoning that people expect from image models right now.

Audio I don't really have much of a handle on. One issue with audio models is that if you really want to represent most audio you find online well, you kind of need to be a great language model as well, considering how much audio is speech or song. But at the same time, audio is a far heavier way to represent that language than text is, so it's far harder to learn all of language from audio.

Jordan117 OP t1_ix2r4p3 wrote on November 20, 2022 at 8:07 AM

I'm no expert either but this definitely felt like the sort of question that sounds basic but hits on some fundamental/abstract "theory of information" sort of complexity. It's why I find it so fascinating -- there's something really mysterious and compelling going on in these models that even the researchers themselves are struggling to unravel. Thanks for taking the time!