PassingTumbleweed t1_ja6w9ai wrote on February 27, 2023 at 7:46 AM

It's weird to read this when RLHF has been one of the key components of chat GPT and friends

cthorrez t1_ja70abd wrote on February 27, 2023 at 8:43 AM

I find it a little weird that RLHF is considered to be reinforcement learning.

The human feedback is collected offline and forms a static dataset. They use the objective from PPO but it's really more of a form of supervised learning. There isn't an agent interacting with an env, the "env" is just sampling text from a static dataset and the reward is the score from a neural net trained on a static dataset.

gniorg t1_ja7sjkn wrote on February 27, 2023 at 2:10 PM

So basically, batch reinforcement learning / offline RL? The family of algorithms is useful for recommender systems, amongst others.

cthorrez t1_ja8d6oc wrote on February 27, 2023 at 4:35 PM

Not exactly. In batch RL the data they train on are real (state, action, next state, reward) tuples from real agents interacting with real environments.

They improve the policy offline. In RLHF there actually is no env. And the policy is just standard LLM decoding.