Viewing a single comment thread. View all comments

farmingvillein t1_jadqg1l wrote

You're missing the point here, or I wasn't clear--the question isn't whether performance will improve with more params (and potentially) data; no doubt there.

The question is whether a model trained at scale on text & images will outperform a model trained at scale solely on text, in the text-only domain (or similarly, the image-only).

To-date, all* of the public research (and Kosmos is no different) on multimodal models have showed, at best, multimodal models generally performing equal to unimodal variants in unimodal domains. And often they are a shade worse (like Kosmos).

(*=unless you count code+natural language.)

The holy grail, of course, is that the two help one another, so that your multimodal variant outperforms the unimodal variants on unimodal tasks. GPT-* gets better at talking to you because it has ingested all of the Youtube videos in the world, e.g.

If you can demonstrate that (and it certainly makes intuitive human sense that this could/should be true), then of course there is a giant truckload of image (including video!) and audio data you can slam into your text models to make text-based scenarios better (and similarly for images, etc.). (And it also more plausibly suggests that massive amounts of synthetic world exploration data could be accretive, too...)

There is a bunch of research (https://arxiv.org/abs/2301.03728 being one of the most exciting) suggesting that this can occur, with enough data/params, but no one has publicly demonstrated it. (And it'd surprise no one, probably, if this was part of GPT-4's or Gato-2's mix.)

40

deliciously_methodic t1_jad1h8m wrote

What does “scale up” mean in this context? I use “scale up” in a ML hardware context vs “scale out” to represent “making a cpu/GPU more powerful” vs “adding more gpus”, but I’m not clear if the analogy is used for AI models, scaling up and out. Or if you simply mean, “the model will get bigger”

−3

farmingvillein t1_jadt897 wrote

FWIW, I was trying to make a more subtle point than OP's response--see my other reply.

4