Submitted by JClub t3_10emf7a in MachineLearning
buzzbuzzimafuzz t1_j4u5jrz wrote
I think what OpenAI and Anthropic typically do is providing evaluators with two possible responses and having them select which one is better. If you have numerical ratings, it might be hard to calibrate them. From the original paper "Deep reinforcement learning from human feedback" (2017):
>We ask the human to compare short video clips of the agent’s
behavior, rather than to supply an absolute numerical score. We found comparisons to be easier for humans to provide in some domains, while being
equally useful for learning human preferences.
Comparing short video clips is nearly as fast as
comparing individual states, but we show that
the resulting comparisons are significantly more
helpful
ChatGPT seems to be trained from a combination of expert-written examples and upvotes and downvotes on individual messages.
JClub OP t1_j4uc8lc wrote
Yes, that makes sense! But for example, can you really combine a thumbs-up/down experience with a scale of 1-5? That will be even harder to make them both work together when training the model, right?
koolaidman123 t1_j4uuko0 wrote
chatgpt (assuming they use same training as instructgpt) doesn't use a numerical scale, everything is a comparison between 2 (out of k) sampled outputs from a prompt, so everything is a pairwise comparison
JClub OP t1_j4v057p wrote
yeah, instructGPT is like that. How do you calculate a reward score for each output in this ranking scenario?
koolaidman123 t1_j4v2uyq wrote
it's just a binary pairwise comparison of which is more preferred between 2 outputs, read the instructgpt paper or the wandb post https://wandb.ai/carperai/summarize_RLHF/reports/Implementing-RLHF-Learning-to-Summarize-with-trlX--VmlldzozMzAwODM2#train-the-reward-model
JClub OP t1_j4v5d0y wrote
Ah right, then you can just use the model's reward directly or pass it through a sigmoid so that the reward is between 0-1!
Do you think that the sigmoid is needed?
Viewing a single comment thread. View all comments