Submitted by JClub t3_10emf7a in MachineLearning

Hey everyone, just saw the great presentation of Nathan Lambert on Reinforcement Learning from Human Feedback and wanted to try to do some RLHF on my language model.To do this, first I need to create an experience where I collect reward scores to train the reward model.

My question is: what rewards work best? Simply πŸ‘/πŸ‘Ž? A scale of 1-5? Ranking 4 different model outputs? There are a lot of options and I don't know which one to choose.

19

Comments

You must log in or register to comment.

buzzbuzzimafuzz t1_j4u5jrz wrote

I think what OpenAI and Anthropic typically do is providing evaluators with two possible responses and having them select which one is better. If you have numerical ratings, it might be hard to calibrate them. From the original paper "Deep reinforcement learning from human feedback" (2017):

>We ask the human to compare short video clips of the agent’s
behavior, rather than to supply an absolute numerical score. We found comparisons to be easier for humans to provide in some domains, while being equally useful for learning human preferences. Comparing short video clips is nearly as fast as comparing individual states, but we show that the resulting comparisons are significantly more helpful

ChatGPT seems to be trained from a combination of expert-written examples and upvotes and downvotes on individual messages.

9

JClub OP t1_j4uc8lc wrote

Yes, that makes sense! But for example, can you really combine a thumbs-up/down experience with a scale of 1-5? That will be even harder to make them both work together when training the model, right?

1

koolaidman123 t1_j4uuko0 wrote

chatgpt (assuming they use same training as instructgpt) doesn't use a numerical scale, everything is a comparison between 2 (out of k) sampled outputs from a prompt, so everything is a pairwise comparison

1

JClub OP t1_j4v057p wrote

yeah, instructGPT is like that. How do you calculate a reward score for each output in this ranking scenario?

1

koolaidman123 t1_j4v2uyq wrote

it's just a binary pairwise comparison of which is more preferred between 2 outputs, read the instructgpt paper or the wandb post https://wandb.ai/carperai/summarize_RLHF/reports/Implementing-RLHF-Learning-to-Summarize-with-trlX--VmlldzozMzAwODM2#train-the-reward-model

2

JClub OP t1_j4v5d0y wrote

Ah right, then you can just use the model's reward directly or pass it through a sigmoid so that the reward is between 0-1!

Do you think that the sigmoid is needed?

2

velcher t1_j4ts9n0 wrote

Disclaimer: I'm a deep RL person, so I'm speaking from a pure RL viewpoint. I have never trained LLM with RLHF (yet ;) ).

You can think of rewards as a way of expressing preferences to the model. Then you can reason about what types of rewards to use.

Binary: either the output is good or bad. There is no preference between outputs that are good (they are all 1) or outputs that are bad (they are all 0). Scale of 1-5: there are 5 preferences of increasing order. In particular, the rank 1 choice is exactly 1 real value (see aside for what the real value does) more than rank 2. Ranking 4 different model outputs: Not sure what you mean here.

Aside: So reward scale can affect the RL process. RL policies are commonly trained through something called the "Policy Gradient", which weights the policy update by the scale of the return (sum of rewards). So the larger your reward scaling, the larger this gradient. Too large rewards can cause the gradient to be too large and lead to an unstable policy, too small rewards can result in small gradients and therefore slow-to-converge policies. This reward scale can be counteracted by the learning rate, or reward normalization. But all of this needs to be tuned for the specific task.

Reward scaling can also affect your RL algorithm, particularly if it uses an entropy penalty for exploration (SAC, TD3, PPO, TRPO etc.).

5

JClub OP t1_j4uc0bg wrote

PPO's formula makes the gradient update always rather smaller than other RL algorithms. I get that the reward is measuring the human's preference but that does not answer my question πŸ€” : what rewards work best for PPO?

1

JacksOngoingPresence t1_j4v9jh4 wrote

There isn't much difference between "Simply πŸ‘/πŸ‘Ž" and "scale of 1-5". They will probably give ~same results. I understand first one as {0, 1} and second as {0, ... , 1}. It's just the question of resolution. the 1-5 thing will most likely give you faster convergence, but it can also f you up if some of your data gets mislabeled. Since it's easier to make mistakes with high resolution.

But in a limit, if you take 1 million different people and ask them to asses your model in a binary fashion, or on a scale of 1 to 10, and then average out results, you will get the same thing. It's just from a human perspective, it's easier to asses things as yes-no. (e.g., "did you like this new movie?" vs "how would you rate this movie on a scale from 1 to 10?"). But from computer's perspective, ML wants that label to be as close to its true value as possible.

2