Viewing a single comment thread. View all comments

dataslacker t1_j4xd5aj wrote

That’s a nice explanation but I’m still unclear as to the motivation for RL. You say the reward isn’t differentiable but since it’s just a label that tells us which of the outputs is best why not simply use that output with supervised training?

7

JClub OP t1_j4xgp2x wrote

You're not the first person that asks me that question! I need to add a more detailed explanation for that :)

The reward is non-differentiable because it was produced with a reward model, and this reward model takes text as input. This text was obtained by decoding the log probabilities of the output of your model. This decoding process is non-differentiable and we lose the gradient link between the LM model and the reward model.

Does this make sense? Also, if the reward is given directly by a human, instead of a reward model, it's clearer that this reward is non-differentiable.

RL helps transforming this non-differentiable reward into a differentiable loss :)

5

dataslacker t1_j4yraoc wrote

Sorry I think didn’t do a great job asking the question. The reward model, as I understand it, will rank the N generated responses from the LLM. So why not take the top ranked response as ground truth, or a weak label if you’d like and train in a supervised fashion predicting the next token. This would avoid a he RL training which I understand is inefficient and unstable.

2

JClub OP t1_j4z57kr wrote

Yes, the reward model can rank model outputs but it does that by giving a score to each output. You want to train with this score, not with "pseudo labeling" as you are stating. But the reward score is non-differentiable, and RL helps to construct a differentiable loss. Does that make sense?

1

dataslacker t1_j4z8zm4 wrote

Yes, your explanations are clear and are also how I understood the paper, but I feel like there's some motivation for the RL training that's missing. Why not "pseudo labeling"? Why is the RL approach better? Also the reward score is non-differentiable because it was designed that way, but they could have designed it to be differentiable. For example instead of decoding the log probs why not train the reward model on them directly? You can still obtain the labels via decoding them doesn't mean that has to be the input to the reward model. There are a number of design choice the authors made that are not motivated in the paper. I haven't read the reference so maybe they are motivated elsewhere in the literature, but RL seems like a strange choice for this problem since there isn't a dynamic environment that the agent is interacting with.

3

JClub OP t1_j4zejga wrote

Yes, 100% agree with you. I believe that the researchers have also tried pseudo labeling or making the reward differentiable as you say, and maybe RL is the SOTA approach now. But these are just guesses!

1

mtocrat t1_j4zecpm wrote

What you're describing is a general approach to RL that is used in different forms in many methods: sample actions, weight or rank them in some way by the estimated return, regress to the weighted actions. So you're not suggesting to do something other than RL but to replace one RL approach with a different RL approach.

2

crazymonezyy t1_j4yjtuz wrote

Amongst other things, RLs major benefit is for learning from a sequence of reward over simply "a reward" which would be the assumption when you treat this is a SL problem. Do remember IID observations is one of the fundamental premises of SL.

1