astonzhang t1_j79i4jj wrote
Hi, I am an author of the paper. Opinions below are my own.
​
After we arXiv-ed our "Automatic Chain of Though Prompting in Large Language Models" paper in Oct 2022 (here's a TLDR, ICLR'23), we were asking ourselves:
"If AGI (artificial general intelligence) is the goal, what kind of chain of thought (CoT) research do we need next? Is relying on a text-only generalist model that can perform text-only multitasks the final answer?"
"How can we connect the dots between NLP and CV communities so more researchers can contribute?"
"Since not everyone can afford playing with large models, how can we deal with input in more general form (text and images) *without* relying on larger models so a larger research community can contribute?"
​
One day I was teaching my kid how to solve arithmetic reasoning problems (not from the MultiArith dataset...). My kid told me that it's much easier to understand reasoning problems with the help from figure illustrations.
"Oh, can we leverage vision input to improve chain of thought reasoning?"
"The current generalist models like GPT-3.5 (text-davinci-002/003) only offer a blackbox API (at a cost) for transforming text input into text output. Why not just fine-tune a smaller model where we have full control of all its layers (whitebox) to fuse inputs in a more general form?"
​
Fortunately, Pan Lu et al. released the ScienceQA benchmark, just in time. This is a great contribution to the community and we benefited from it by testing our idea early on this benchmark (see acknowledgement in our GitHub repo). Showing the promise of fine-tuning a smaller model with task-specific datasets (rather than feeding in-context learning demos to a larger generalist LLM) is exactly what we wanted in this study (you may feel more motivated after reading the T-Few paper).
If you feel motivated to try parameter-efficient fine-tuning (PEFT) ideas from the aforementioned T-Few paper to improve Multimodal-CoT, you may also wish to check out our recent PEFT design space paper at ICLR'23 (here's a TLDR).
42gauge t1_j7e9twt wrote
> I was teaching my kid how to solve arithmetic reasoning problems (not from the MultiArith dataset...
lol ..
zisyfos t1_j9j7zsk wrote
Really interesting! What are the minimum requirements to run this?
astonzhang t1_j9scuwn wrote
We ran experiments on 4 NVIDIA Tesla V100 32G GPUs
IluvBsissa t1_j9j9ml9 wrote
Dr. Zhang, thank you so much. Please can you tell us more about your model's performance ? How would it do on standard MMLU ? Can it be improved by increasing parameters count ? The paper didn't mention if the human testers were average human or experts ?
astonzhang t1_j9sd3mw wrote
The human performance was taken from the paper from Lu et al.
chinguetti t1_j9joqfu wrote
Will make a good story when you accept your Nobel prize. Well done.
ihopeshelovedme t1_j9nhhgs wrote
You think the r/singularity will be kind enough to grant him a Nobel price?
lwl t1_j8hoxpg wrote
Super interesting work, thank you for sharing! If you are still active on reddit - we noticed that the pdf is no longer available on arxiv, are you able to say why that is?
astonzhang t1_j8kcydh wrote
Can you check it again?
lwl t1_j8m2h7b wrote
Ah great, thanks!!
JClub t1_jabyh73 wrote
GPT was never trained with image data, why is this a fair comparison? The UnifiedQA model is from 2022, so it doesn't seem fair either. Why don't we have some comparisons with other SOTA multimodal models? Such as OFA or UniT
Viewing a single comment thread. View all comments