killver t1_jdzbsaz wrote on March 28, 2023 at 8:27 AM

I think the tricky thing about actually validating zero-shot capabilities is again a question of in-sample vs. out-of-sample. Which of these samples has ChatGPT actually already seen?

Disastrous_Elk_6375 t1_jdzccfq wrote on March 28, 2023 at 8:36 AM

Is 0shot really the strength of GPT and especially chatGPT? From my (limited) experience interacting with chatGPT, the value seems to come from prompt understanding and adaptation to my next prompts / corrections. In the context of an assistant, I'm ok with priming the conversation first, if it can handle the subsequent requests better.

stimulatedecho t1_jdzxtb6 wrote on March 28, 2023 at 12:48 PM

"complex reasoning is perhaps the most interesting feature of these models right now and it is unfortunately mostly absent from this survey"

Bingo. It is also the hardest to quantify; it's one of those "I know it when I see it" sort of behaviors. It is easy to imagine how one might harness that ability to reason to solve all sorts of problems, including (but certainly not limited to) improving benchmark performances. I think that is what has a lot of people excited.

rshah4 t1_jdzyo8u wrote on March 28, 2023 at 12:55 PM

Nice work! -- How were the results when comparing using ChatGPT zero shot versus few shot? I have noticed that when using LLMs, you can get an improvement by using few shot learning with LLMs (giving it a few examples in the prompts).

I am not surprised for traditional NLP tasks that we don't see much of an improvement over GPT-3. It seems much of the focus from OpenAI is not on these benchmarks but on trying to make the results more useful to people (all the Instruction tuning / RLHF work).

https://arxiv.org/pdf/2209.12356.pdf https://arxiv.org/pdf/2301.13848.pdf

Also, for real-world use, it's not necessary that ChatGPT beats a fine-tuned SOTA model. ChatGPT is much easier to use than having to fine-tune a more traditional model.

matus_pikuliak OP t1_je0am88 wrote on March 28, 2023 at 2:24 PM

Only some papers used few-shot prompting, and it was usually beneficial and sometimes it helped to beat the SOTA.

Yeah, OpenAI definitely does not care about these benchmarks, but I think they are still useful to see how capable the models are. I find it hard to imagine that the models could be used in some applications if they can not reliably do the even simple tasks evaluated by these benchmarks.

rshah4 t1_je0crbz wrote on March 28, 2023 at 2:38 PM

I agree, these baselines are useful. I think we should push for is more human baselines for these benchmarks. That would help figure out how far we have left to go.

[P] ChatGPT Survey: Performance on NLP datasets

Comments