MysteryInc152 t1_jdvqj47 wrote on March 27, 2023 at 3:33 PM

Reply to comment by was_der_Fall_ist in [D]GPT-4 might be able to tell you if it hallucinated by Cool_Abbreviations_9

That's not what I meant in regards to calibration. It's not about saying an answer x% of the time or not. It's about being able to correctly estimate gaps in knowledge.

Good calibration is what you want.

was_der_Fall_ist t1_jdw2ya2 wrote on March 27, 2023 at 4:54 PM

Check out this LessWrong thread in the comments.

Paul Christiano, alignment researcher at ARC/ previously OpenAI, explains the RLHF change the exact way I did (because I was pretty much quoting him), and someone replies:

> Perhaps I am misunderstanding Figure 8? I was assuming that they asked the model for the answer, then asked the model what probability it thinks that that answer is correct. Under this assumption, it looks like the pre-trained model outputs the correct probability, but the RLHF model gives exaggerated probabilities because it thinks that will trick you into giving it higher reward.

And Paul replies:

> Yes, I think you are misunderstanding figure 8. I don't have inside information, but without explanation "calibration" would almost always mean reading it off from the logits. If you instead ask the model to express its uncertainty I think it will do a much worse job, and the RLHF model will probably perform similarly to the pre-trained model. (This depends on details of the human feedback, under a careful training regime it would probably get modestly better.)

meister2983 t1_jdwt675 wrote on March 27, 2023 at 7:40 PM

Also this is for multiple choice questions (MMLU). I don't think they reported if the pre-RLHF model confidence numbers on fill in the blank world facts aligned to reality.