Submitted by Balance- t3_120guce in MachineLearning
GPT-4 is a multimodal model, which specifically accepts image and text inputs, and emits text outputs. And I just realised: You can layer this over any application, or even combinations of them. You can make a screenshot tool in which you can ask question.
This makes literally any current software with an GUI machine-interpretable. A multimodal language model could look at the exact same interface that you are. And thus you don't need advanced integrations anymore.
Of course, a custom integration will almost always be better, since you have better acces to underlying data and commands, but the fact that it can immediately work on any program will be just insane.
Just a thought I wanted to share, curious what everybody thinks.
BinarySplit t1_jdh9zu6 wrote
GPT-4 is potentially missing a vital feature to take this one step further: Visual Grounding - the ability to say where inside an image a specific element is, e.g. if the model wants to click a button, what X,Y position on the screen does that translate to?
Other MLLMs have it though, e.g. One-For-All. I guess it's only a matter of time before we can get MLLMs to provide a layer of automation over desktop applications...