Hey everyone, just saw the great presentation of Nathan Lambert on Reinforcement Learning from Human Feedback and wanted to try to do some RLHF on my language model.To do this, first I need to create an experience where I collect reward scores to train the reward model.

My question is: what rewards work best? Simply 👍/👎? A scale of 1-5? Ranking 4 different model outputs? There are a lot of options and I don't know which one to choose.

Comments

You must log in or register to comment.

[deleted] t1_j4sm206 wrote on January 17, 2023 at 11:43 PM

#1,390,704

[deleted]

velcher t1_j4ts9n0 wrote on January 18, 2023 at 4:55 AM

#1,393,069

Disclaimer: I'm a deep RL person, so I'm speaking from a pure RL viewpoint. I have never trained LLM with RLHF (yet ;) ).

You can think of rewards as a way of expressing preferences to the model. Then you can reason about what types of rewards to use.

Binary: either the output is good or bad. There is no preference between outputs that are good (they are all 1) or outputs that are bad (they are all 0). Scale of 1-5: there are 5 preferences of increasing order. In particular, the rank 1 choice is exactly 1 real value (see aside for what the real value does) more than rank 2. Ranking 4 different model outputs: Not sure what you mean here.

Aside: So reward scale can affect the RL process. RL policies are commonly trained through something called the "Policy Gradient", which weights the policy update by the scale of the return (sum of rewards). So the larger your reward scaling, the larger this gradient. Too large rewards can cause the gradient to be too large and lead to an unstable policy, too small rewards can result in small gradients and therefore slow-to-converge policies. This reward scale can be counteracted by the learning rate, or reward normalization. But all of this needs to be tuned for the specific task.

Reward scaling can also affect your RL algorithm, particularly if it uses an entropy penalty for exploration (SAC, TD3, PPO, TRPO etc.).

buzzbuzzimafuzz t1_j4u5jrz wrote on January 18, 2023 at 7:16 AM

#1,393,717

I think what OpenAI and Anthropic typically do is providing evaluators with two possible responses and having them select which one is better. If you have numerical ratings, it might be hard to calibrate them. From the original paper "Deep reinforcement learning from human feedback" (2017):

>We ask the human to compare short video clips of the agent’s
behavior, rather than to supply an absolute numerical score. We found comparisons to be easier for humans to provide in some domains, while being equally useful for learning human preferences. Comparing short video clips is nearly as fast as comparing individual states, but we show that the resulting comparisons are significantly more helpful

ChatGPT seems to be trained from a combination of expert-written examples and upvotes and downvotes on individual messages.

JClub OP t1_j4uc0bg wrote on January 18, 2023 at 8:41 AM

#1,393,945

Replying to velcher (#1,393,069)

PPO's formula makes the gradient update always rather smaller than other RL algorithms. I get that the reward is measuring the human's preference but that does not answer my question 🤔 : what rewards work best for PPO?

JClub OP t1_j4uc8lc wrote on January 18, 2023 at 8:44 AM

#1,393,957

Replying to buzzbuzzimafuzz (#1,393,717)

Yes, that makes sense! But for example, can you really combine a thumbs-up/down experience with a scale of 1-5? That will be even harder to make them both work together when training the model, right?

koolaidman123 t1_j4uuko0 wrote on January 18, 2023 at 12:35 PM

#1,394,804

Replying to JClub (#1,393,957)

chatgpt (assuming they use same training as instructgpt) doesn't use a numerical scale, everything is a comparison between 2 (out of k) sampled outputs from a prompt, so everything is a pairwise comparison

JClub OP t1_j4v057p wrote on January 18, 2023 at 1:25 PM

#1,395,088

Replying to koolaidman123 (#1,394,804)

yeah, instructGPT is like that. How do you calculate a reward score for each output in this ranking scenario?

koolaidman123 t1_j4v2uyq wrote on January 18, 2023 at 1:47 PM

#1,395,244

Replying to JClub (#1,395,088)

it's just a binary pairwise comparison of which is more preferred between 2 outputs, read the instructgpt paper or the wandb post https://wandb.ai/carperai/summarize_RLHF/reports/Implementing-RLHF-Learning-to-Summarize-with-trlX--VmlldzozMzAwODM2#train-the-reward-model

JClub OP t1_j4v5d0y wrote on January 18, 2023 at 2:06 PM

#1,395,428

Replying to koolaidman123 (#1,395,244)

Ah right, then you can just use the model's reward directly or pass it through a sigmoid so that the reward is between 0-1!

Do you think that the sigmoid is needed?

JacksOngoingPresence t1_j4v9jh4 wrote on January 18, 2023 at 2:37 PM

#1,395,670

There isn't much difference between "Simply 👍/👎" and "scale of 1-5". They will probably give ~same results. I understand first one as {0, 1} and second as {0, ... , 1}. It's just the question of resolution. the 1-5 thing will most likely give you faster convergence, but it can also f you up if some of your data gets mislabeled. Since it's easier to make mistakes with high resolution.

But in a limit, if you take 1 million different people and ask them to asses your model in a binary fashion, or on a scale of 1 to 10, and then average out results, you will get the same thing. It's just from a human perspective, it's easier to asses things as yes-no. (e.g., "did you like this new movie?" vs "how would you rate this movie on a scale from 1 to 10?"). But from computer's perspective, ML wants that label to be as close to its true value as possible.