currentscurrents t1_jdft0hp wrote on March 24, 2023 at 1:59 AM

>They seem to refer to this model as text-only, contradicting to the known fact that GPT-4 is multi-modal.

I noticed this in the original paper as well.

This probably means that they implemented multimodality the same way Palm-E did; starting with a pretrained LLM.

was_der_Fall_ist t1_jdgmd2t wrote on March 24, 2023 at 6:49 AM

As far as I understand, that’s exactly what they did. That’s why the public version of GPT-4 is text-only so far. The vision part came after.

Perhaps they're saying that because it can only output text. Multimodality is limited to images + text as inputs.

How do you input images to GPT4? Via the API?

It's not available to the public yet, restricted to specific groups that are conducting research.