Submitted by JClub t3_10emf7a in MachineLearning
Hey everyone, just saw the great presentation of Nathan Lambert on Reinforcement Learning from Human Feedback and wanted to try to do some RLHF on my language model.To do this, first I need to create an experience where I collect reward scores to train the reward model.
My question is: what rewards work best? Simply π/π? A scale of 1-5? Ranking 4 different model outputs? There are a lot of options and I don't know which one to choose.
buzzbuzzimafuzz t1_j4u5jrz wrote
I think what OpenAI and Anthropic typically do is providing evaluators with two possible responses and having them select which one is better. If you have numerical ratings, it might be hard to calibrate them. From the original paper "Deep reinforcement learning from human feedback" (2017):
>We ask the human to compare short video clips of the agentβs
behavior, rather than to supply an absolute numerical score. We found comparisons to be easier for humans to provide in some domains, while being equally useful for learning human preferences. Comparing short video clips is nearly as fast as comparing individual states, but we show that the resulting comparisons are significantly more helpful
ChatGPT seems to be trained from a combination of expert-written examples and upvotes and downvotes on individual messages.