Submitted by JClub t3_10fh79i in MachineLearning
JClub OP t1_j4zejga wrote
Reply to comment by dataslacker in [R] A simple explanation of Reinforcement Learning from Human Feedback (RLHF) by JClub
Yes, 100% agree with you. I believe that the researchers have also tried pseudo labeling or making the reward differentiable as you say, and maybe RL is the SOTA approach now. But these are just guesses!
Viewing a single comment thread. View all comments