Abstract:

>Large Language Models (LLMs) have achieved excellent performances in various tasks. However, fine-tuning an LLM requires extensive supervision. Human, on the other hand, may improve their reasoning abilities by self-thinking without external inputs. In this work, we demonstrate that an LLM is also capable of self-improving with only unlabeled datasets. We use a pre-trained LLM to generate "high-confidence" rationale-augmented answers for unlabeled questions using Chain-of-Thought prompting and self-consistency, and fine-tune the LLM using those self-generated solutions as target outputs. We show that our approach improves the general reasoning ability of a 540B-parameter LLM (74.4%->82.1% on GSM8K, 78.2%->83.0% on DROP, 90.0%->94.4% on OpenBookQA, and 63.4%->67.9% on ANLI-A3) and achieves state-of-the-art-level performance, without any ground truth label. We conduct ablation studies and show that fine-tuning on reasoning is critical for self-improvement.

Comments

You must log in or register to comment.

Material_Opening7336 t1_itmdpvs wrote on October 24, 2022 at 6:34 PM

#205,627

Very impressive. Thank you for sharing your paper. I will let you know if I have any questions

[deleted] t1_itmrvzz wrote on October 24, 2022 at 8:05 PM

#207,203

[deleted]

say_wot_again t1_itn823q wrote on October 24, 2022 at 9:51 PM

#208,853

From the abstract, it seems very similar to common self supervised techniques in computer vision. The difference is that in the case of computer vision SSL, you use the model's confident outputs on normal data to train its performance on heavily augmented data, whereas here you use the model's performance on "chain of thought" prompts to train its performance on normal prompts. But either way, the principle of "use the model's high confidence outputs on easy examples to train it on hard examples" stays the same. It's always cool to see this sort of cross pollination between vision and NLP, though the title seems designed to conjure up images of Westworld or Ex Machina.

Edit: it appears one massive difference is that in vision, the augmentation come from the modeler, whereas here the chains of thought actually come from the model's outputs. So it's leveraging the inherent randomness in LLM outputs to generate new training data by relying on the idea that answers that frequently appear in the output are likelier to be correct. This IS pretty cool, and meaningfully different from the vision SSL case insofar as it requires much less manual intervention.

Pwhids t1_itn9glu wrote on October 24, 2022 at 10:01 PM

#209,023

They show that the large LMSI models can be distilled into smaller models while maintaining accuracy, but I wonder what size model is necessary for the LMSI training itself to be viable. They only show results for 540B. Would be very curious to see a study here if there is a certain model size where this kicks in.

shazvaz t1_itnadll wrote on October 24, 2022 at 10:08 PM

#209,137

You want skynet? You want the singularity? This is how you get there.

Nice knowing ya folks.

[deleted] t1_itnl1x0 wrote on October 24, 2022 at 11:28 PM

#210,283

Replying to Pwhids (#209,023)

[deleted]

ReasonablyBadass t1_itolj2g wrote on October 25, 2022 at 4:07 AM

#213,690

Basic question: chain of thought prompting already generates it's own prompts for the next step, right? So this also generates answers?

Lajamerr_Mittesdine OP t1_itomfs6 wrote on October 25, 2022 at 4:16 AM

#213,792

Replying to ReasonablyBadass (#213,690)

CoT simply breaks down a problem into multiple interconnected solution statements to arrive at one conclusive answer.

You can prompt a CoT Model to go down different reasoning structures and arrive at different answers(but sometimes wrong) but those are all independent from one another.

Note that this is fine-tuning an existing LLM.

This fine-tuning is in part done by a hypermodel that helps rank solutions. These solutions are then used to fine-tune the model even further to become better reasoners using its own generated answers.

So the model uses its own understandings to generate CoT solution statements. The hypermodel would rank those statements and then the existing model can be fine-tuned on the newly generated positive and negative solutions reinforcing the idea of what correct solution statements look like and what negative ones look like as well.

Future work: So what is limiting the LLM model from eventually getting to 100%~ ? The bottleneck from preventing this going exponential is the hypermodel that can accurately rank the solution. Theoretically if you had a perfect ranker blackbox you could eventually get to 100%~. So what you would want in future work is either just a more accurate ranker overall or someway to continuously improve the ranker hypermodel in an unsupervised fashion just like we have this hypermodel for the LLM.

Personal Opinion: So what this really is doing is just solving some low hanging fruit in prompting the LLM in reasonings it already understands in different contexts and more finely puts them as the highest ranking solutions across a broader range. It's not learning new concepts entirely.

hiptobecubic t1_itoswbc wrote on October 25, 2022 at 5:25 AM

#214,351

Didn't the Greeks try this? It's a mess until you have an epiphany and realize that you have to verify the truth of a statement before you start building on top of it.

red75prime t1_itoxv2y wrote on October 25, 2022 at 6:27 AM

#214,835

Replying to hiptobecubic (#214,351)

Greeks arguably got rules of logic out of this.

sirgarvey t1_itpo7jt wrote on October 25, 2022 at 12:09 PM

#217,185

Really appreciate the sober assessment here. Now take a peek over here for some laughs: https://reddit.com/r/singularity/comments/ybzh5j/large_language_models_can_selfimprove/

DeezNUTSampler t1_itq1l2d wrote on October 25, 2022 at 1:57 PM

#218,883

Replying to say_wot_again (#208,853)

Can you link works in Computer Vision SSL which incorporate this principle “use model’s high confidence outputs on easy examples to train it on hard examples”? It is not obvious to me how this would work. For example, in contrastive learning the objective is to learn view invariant representations. Two views of an object, augmented differently, are pushed together in representation space by minimizing the distance between them as our loss function. Which one would constitute the easy/hard example here?

say_wot_again t1_itrmhsx wrote on October 25, 2022 at 8:07 PM

#225,510

Replying to DeezNUTSampler (#218,883)

Here's an example of what I had in mind. Pseudolabels for unlabeled data are generated on the clean images, but the student model is trained on a strongly augmented version of the image. It's not contrastive learning because the objective is still explicitly object detection, but instead easy vs hard is the original image vs the strongly augmented one.