Paper : https://arxiv.org/abs/2301.13379

Abstract :

>While Chain-of-Thought (CoT) prompting boosts Language Models' (LM) performance on a gamut of complex reasoning tasks, the generated reasoning chain does not necessarily reflect how the model arrives at the answer (aka. faithfulness). We propose Faithful CoT, a faithful-by-construction framework that decomposes a reasoning task into two stages: Translation (Natural Language query → symbolic reasoning chain) and Problem Solving (reasoning chain → answer), using an LM and a deterministic solver respectively. We demonstrate the efficacy of our approach on 10 reasoning datasets from 4 diverse domains. It outperforms traditional CoT prompting on 9 out of the 10 datasets, with an average accuracy gain of 4.4 on Math Word Problems, 1.9 on Planning, 4.0 on Multi-hop Question Answering (QA), and 18.1 on Logical Inference, under greedy decoding. Together with self-consistency decoding, we achieve new state-of-the-art few-shot performance on 7 out of the 10 datasets, showing a strong synergy between faithfulness and accuracy.

Comments

You must log in or register to comment.

Alternative-Prize612 t1_j6q4zal wrote on February 1, 2023 at 2:30 AM

#1,694,083

Amazing, thanks for posting.

RandomCandor t1_j6qr7t0 wrote on February 1, 2023 at 5:37 AM

#1,694,345

Man, i feel like we're living the beginning an arms race of AI.

What a time to be alive! ( Like one of my favorite YouTubers would say)

Acceptable-Cress-374 t1_j6qutn3 wrote on February 1, 2023 at 6:17 AM

#1,694,388

Replying to RandomCandor (#1,694,345)

Hold on to your papers!

IsABot-Ban t1_j6qxhi3 wrote on February 1, 2023 at 6:48 AM

#1,694,438

Replying to Acceptable-Cress-374 (#1,694,388)

Remember we look at where it will be a few papers down the road.

mlresearchoor t1_j6r8x7y wrote on February 1, 2023 at 9:20 AM

#1,694,618

nice find! would be helpful, as well, to compare with similar papers from 2022 that this paper cites, but did not compare to in results section

("We note that our work is concurrent with Chen et al. (2022) and Gao et al. (2022), both generating the reasoning chain in Python code and calling a Python interpreter to derive the answer. While we do not compare with them empirically since they are not yet published...")

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks (Chen)
https://arxiv.org/abs/2211.12588

PAL: Program-aided Language Models (Gao)
https://arxiv.org/abs/2211.10435

mlresearchoor t1_j6r93hm wrote on February 1, 2023 at 9:22 AM

#1,694,619

Replying to RandomCandor (#1,694,345)

we got front-row seats to this race and a chance to participate, +1 great time to be alive

codename_failure t1_j6riypw wrote on February 1, 2023 at 11:35 AM

#1,694,783

Replying to IsABot-Ban (#1,694,438)

Well done, little AI!

Infinite-Recover-712 t1_j6s8uaj wrote on February 1, 2023 at 3:13 PM

#1,695,540

questionnaire to Reddit https://docs.google.com/forms/d/e/1FAIpQLSekbvorz3RUhlDd1JqC2P8n6E9dTU8CWdugEJMcyd-pqUpPDw/viewform?usp=sf_link

Nhabls t1_j6sbq3i wrote on February 1, 2023 at 3:31 PM

#1,695,651

Replying to RandomCandor (#1,694,345)

The arms race has been going for over a decade now...

LetterRip t1_j6shnin wrote on February 1, 2023 at 4:09 PM

#1,695,879

Replying to mlresearchoor (#1,694,618)

The prompts are so specific to the datasets for those two papers they don't seem very useful. We'll have to wait for the code to see if FCoT is a similar case or not.

throwaway2676 t1_j6syciq wrote on February 1, 2023 at 5:53 PM

#1,696,513

Woah, hey, this is basically what I proposed last month

oscineStyron415 t1_j6titmk wrote on February 1, 2023 at 7:58 PM

#1,697,339

Was a good read. Lots of big movement these past few months

axm92 t1_j6tw995 wrote on February 1, 2023 at 9:20 PM

#1,697,864

Replying to LetterRip (#1,695,879)

Thanks! Can you please clarify what do you mean by prompts are specific to the datasets for PaL?

As an example, for the ~10 math reasoning datasets used in PaL, identical prompts were used (same prompt for all datasets, without changing anything). The prompts/code is also open sourced at https://reasonwithpal.com/ if you want to check if out!

Incidentally, the idea that Python programs lead to faithful reasoning chains was used in PaL to create a new split of GSM, called GSM-hard. GSM-hard is available on huggingface.

(I'm a co-author of the PaL paper. )

LetterRip t1_j6u7cu9 wrote on February 1, 2023 at 10:31 PM

#1,698,314

Replying to axm92 (#1,697,864)

In my view something like "Let's think things through step by step" prompt is extremely generic and requires no knowledge specific to the upcoming questions.

I was basing my comment on the content of this folder mostly,

https://github.com/reasoning-machines/pal/tree/main/pal/prompt

Each of the prompts seem to require extensive knowledge of the test set to have formulated the prompts.

This seems more akin to Watson where the computer scientists analyzed the form of a variety of questions and did programs for each type of question.

axm92 t1_j6uf2a7 wrote on February 1, 2023 at 11:23 PM

#1,698,692

Replying to LetterRip (#1,698,314)

Ah I see, thanks for clarifying. I see your point, but I wouldn't say that the prompts require an extensive knowledge of the test set. After all:

> As an example, for the ~10 math reasoning datasets used in PaL, identical prompts were used (same prompt for all datasets, without changing anything).

Notably, take a look at the section on GSM-hard (4.1). You may also enjoy the analysis in the new version of the paper (Section 6: https://arxiv.org/pdf/2211.10435.pdf).

Further, "Let's think step by step" is outperformed by "Write Python code to solve this." We'll add the numbers in the next version, but if you are interested please lmk and I can share the results earlier.

Thanks again for reading our work and sharing your feedback, I really appreciate it.

LetterRip t1_j6uj087 wrote on February 1, 2023 at 11:50 PM

#1,698,903

Replying to axm92 (#1,698,692)

> Further, "Let's think step by step" is outperformed by "Write Python code to solve this."

Interesting I was just wondering while reading that paper how well that would work compared to the n-shot prompts.

> Ah I see, thanks for clarifying. I see your point, but I wouldn't say that the prompts require an extensive knowledge of the test set. After all:

>> As an example, for the ~10 math reasoning datasets used in PaL, identical prompts were used (same prompt for all datasets, without changing anything).

That's fair. My thoughts were mostly directed at the "Table 2: Solve rate on three symbolic reasoning datasets and two algorithmic datasets" items. I think you could be right that my comments don't apply to the results in Figure 5 (GSM8K GSM-HARD SVAMP ASDIV SINGLEEQ SINGLEOP ADDSUB MULTIARITH).

Would be curious how well the 'write python code to solve this' performs in and of itself vs the "Let's think things through step by step" prompt.

Hyper1on t1_j6xm8m9 wrote on February 2, 2023 at 4:48 PM

#1,704,758

This is a fine approach, but it's not necessarily chain of thought if you move the actual problem solving outside of the LM. The entire point of Chain of Thought as originally conceived is that it's a better way of doing within-model problem solving. I would be interested to see the result if you were to finetune the LM on a dataset of reasoning from this approach, however.