Submitted by Empty-Revolution7570 t3_11sfj5s in MachineLearning
The newly released GPT-4 allows users to upload images, but we're still far from having a truly capable multimodal model. So we built this project as a feasibility study (and for fun!) to see how much we can do with just tuning the prompts. In short, we try to "connect" different models (vision, audio, etc) via carefully designed prompts.
Multimedia GPT connects your OpenAI GPT with vision and audio. You can now send images, videos (in development), and even audio recordings using your OpenAI API key. We base our project on Microsoft's Visual ChatGPT, which achieves some success just by tuning the prompts.
Check-out our project here! We also have a cool demo where Multimedia GPT successfully understands a person telling a story!
​
Any suggestion is appreciated~
MysteryInc152 t1_jcdthob wrote
Are you using Gpt-Vision ? Or are there separate assortments of visual foundation models ?