The newly released GPT-4 allows users to upload images, but we're still far from having a truly capable multimodal model. So we built this project as a feasibility study (and for fun!) to see how much we can do with just tuning the prompts. In short, we try to "connect" different models (vision, audio, etc) via carefully designed prompts.

Multimedia GPT connects your OpenAI GPT with vision and audio. You can now send images, videos (in development), and even audio recordings using your OpenAI API key. We base our project on Microsoft's Visual ChatGPT, which achieves some success just by tuning the prompts.

Check-out our project here! We also have a cool demo where Multimedia GPT successfully understands a person telling a story!

https://preview.redd.it/6x6pjamt30oa1.png?width=3024&format=png&auto=webp&v=enabled&s=30f6c9e5b9329642ebda40241f4ac2aca464c4d8

https://preview.redd.it/3dr5tamt30oa1.png?width=2950&format=png&auto=webp&v=enabled&s=9b3fc71822a7b1f9bc008ffb57b49b6b2c4bfb6d

Any suggestion is appreciated~

Comments

You must log in or register to comment.

MysteryInc152 t1_jcdthob wrote on March 16, 2023 at 2:41 AM

Are you using Gpt-Vision ? Or are there separate assortments of visual foundation models ?

Empty-Revolution7570 OP t1_jcdtuff wrote on March 16, 2023 at 2:43 AM

Yes, I included all the VFMs. I added upon those a few more, such as OpenAI Whisper. Still exploring how to incorporate video models

MysteryInc152 t1_jcduvhn wrote on March 16, 2023 at 2:51 AM

I'm sorry maybe I want clear but you obviously have API access to GPT-4 right ? Does this access include an API call to their Vision model ? Or are you sending the images straight to BLIP and the like.

Empty-Revolution7570 OP t1_jcdv1nt wrote on March 16, 2023 at 2:53 AM

No, it understands image through other models on hugging face, and outputs image with diffusers or OpenAI dalle

ml_head t1_jcf3e64 wrote on March 16, 2023 at 11:41 AM

So, the model recognized the Cinderella story in the audio. But how do we know that summary was generated from the audio, and not from prior knowledge of the story? I know that those models can do this task. However, for the demo I would use an original story instead.

Empty-Revolution7570 OP t1_jcfx1wv wrote on March 16, 2023 at 3:29 PM

That makes sense!

Based on how it works it think original stories would also work.

ml_head t1_jcjyon2 wrote on March 17, 2023 at 11:43 AM

I'm sure that it does. And would beca better demo of the technology. Maybe, keep the Cinderella story too, since some people wouldn't read your original story and wouldn't be able to tell if the summary is good. You might want to add an image with your original story in a format that wouldn't be easy to OCR, like using weird font on noisy background. In this way you are making the story available to humans but taking measures to hide it from any web crawler used by language models.