Viewing a single comment thread. View all comments

dataslacker t1_j4yraoc wrote

Sorry I think didn’t do a great job asking the question. The reward model, as I understand it, will rank the N generated responses from the LLM. So why not take the top ranked response as ground truth, or a weak label if you’d like and train in a supervised fashion predicting the next token. This would avoid a he RL training which I understand is inefficient and unstable.

2

JClub OP t1_j4z57kr wrote

Yes, the reward model can rank model outputs but it does that by giving a score to each output. You want to train with this score, not with "pseudo labeling" as you are stating. But the reward score is non-differentiable, and RL helps to construct a differentiable loss. Does that make sense?

1

dataslacker t1_j4z8zm4 wrote

Yes, your explanations are clear and are also how I understood the paper, but I feel like there's some motivation for the RL training that's missing. Why not "pseudo labeling"? Why is the RL approach better? Also the reward score is non-differentiable because it was designed that way, but they could have designed it to be differentiable. For example instead of decoding the log probs why not train the reward model on them directly? You can still obtain the labels via decoding them doesn't mean that has to be the input to the reward model. There are a number of design choice the authors made that are not motivated in the paper. I haven't read the reference so maybe they are motivated elsewhere in the literature, but RL seems like a strange choice for this problem since there isn't a dynamic environment that the agent is interacting with.

3

JClub OP t1_j4zejga wrote

Yes, 100% agree with you. I believe that the researchers have also tried pseudo labeling or making the reward differentiable as you say, and maybe RL is the SOTA approach now. But these are just guesses!

1