I think the tricky thing about actually validating zero-shot capabilities is again a question of in-sample vs. out-of-sample. Which of these samples has ChatGPT actually already seen?
Is 0shot really the strength of GPT and especially chatGPT? From my (limited) experience interacting with chatGPT, the value seems to come from prompt understanding and adaptation to my next prompts / corrections. In the context of an assistant, I'm ok with priming the conversation first, if it can handle the subsequent requests better.
"complex reasoning is perhaps the most interesting feature of these models right now and it is unfortunately mostly absent from this survey"
Bingo. It is also the hardest to quantify; it's one of those "I know it when I see it" sort of behaviors. It is easy to imagine how one might harness that ability to reason to solve all sorts of problems, including (but certainly not limited to) improving benchmark performances. I think that is what has a lot of people excited.
Nice work! -- How were the results when comparing using ChatGPT zero shot versus few shot? I have noticed that when using LLMs, you can get an improvement by using few shot learning with LLMs (giving it a few examples in the prompts).
I am not surprised for traditional NLP tasks that we don't see much of an improvement over GPT-3. It seems much of the focus from OpenAI is not on these benchmarks but on trying to make the results more useful to people (all the Instruction tuning / RLHF work).
Also, for real-world use, it's not necessary that ChatGPT beats a fine-tuned SOTA model. ChatGPT is much easier to use than having to fine-tune a more traditional model.
Only some papers used few-shot prompting, and it was usually beneficial and sometimes it helped to beat the SOTA.
Yeah, OpenAI definitely does not care about these benchmarks, but I think they are still useful to see how capable the models are. I find it hard to imagine that the models could be used in some applications if they can not reliably do the even simple tasks evaluated by these benchmarks.
I agree, these baselines are useful. I think we should push for is more human baselines for these benchmarks. That would help figure out how far we have left to go.
killver t1_jdzbsaz wrote
I think the tricky thing about actually validating zero-shot capabilities is again a question of in-sample vs. out-of-sample. Which of these samples has ChatGPT actually already seen?