buzzbuzzimafuzz t1_j4u5jrz wrote on January 18, 2023 at 7:16 AM

I think what OpenAI and Anthropic typically do is providing evaluators with two possible responses and having them select which one is better. If you have numerical ratings, it might be hard to calibrate them. From the original paper "Deep reinforcement learning from human feedback" (2017):

>We ask the human to compare short video clips of the agent’s
behavior, rather than to supply an absolute numerical score. We found comparisons to be easier for humans to provide in some domains, while being equally useful for learning human preferences. Comparing short video clips is nearly as fast as comparing individual states, but we show that the resulting comparisons are significantly more helpful

ChatGPT seems to be trained from a combination of expert-written examples and upvotes and downvotes on individual messages.

JClub OP t1_j4uc8lc wrote on January 18, 2023 at 8:44 AM

Yes, that makes sense! But for example, can you really combine a thumbs-up/down experience with a scale of 1-5? That will be even harder to make them both work together when training the model, right?

koolaidman123 t1_j4uuko0 wrote on January 18, 2023 at 12:35 PM

chatgpt (assuming they use same training as instructgpt) doesn't use a numerical scale, everything is a comparison between 2 (out of k) sampled outputs from a prompt, so everything is a pairwise comparison

JClub OP t1_j4v057p wrote on January 18, 2023 at 1:25 PM

yeah, instructGPT is like that. How do you calculate a reward score for each output in this ranking scenario?

koolaidman123 t1_j4v2uyq wrote on January 18, 2023 at 1:47 PM

it's just a binary pairwise comparison of which is more preferred between 2 outputs, read the instructgpt paper or the wandb post https://wandb.ai/carperai/summarize_RLHF/reports/Implementing-RLHF-Learning-to-Summarize-with-trlX--VmlldzozMzAwODM2#train-the-reward-model

JClub OP t1_j4v5d0y wrote on January 18, 2023 at 2:06 PM

Ah right, then you can just use the model's reward directly or pass it through a sigmoid so that the reward is between 0-1!

Do you think that the sigmoid is needed?