_Arsenie_Boca_ t1_j6z24n6 wrote on February 2, 2023 at 10:10 PM

Since it wasnt mentioned so far: RL does not require the loss/reward to be differentiable. This enables us to learn from complete generated sentences (LM sampling is not differentiable) rather than just on token-level

VP4770 t1_j7186vz wrote on February 3, 2023 at 9:55 AM

This

alpha-meta OP t1_j72dpto wrote on February 3, 2023 at 4:08 PM

Good point, so you mean they incorporate things like beam search + changing temperature, top-k sampling, and nucleus sampling in the RL PPO-based optimizaton?

_Arsenie_Boca_ t1_j72g4g4 wrote on February 3, 2023 at 4:23 PM

Im not sure if they vary the sampling hyperparemeters. The point is that langauge modelling objectives are to some degree ill-posed because we calculate the loss on intermediate results rather than the final output that we care about.