Viewing a single comment thread. View all comments

cthorrez t1_ja8d6oc wrote on February 27, 2023 at 4:35 PM

Reply to comment by gniorg in [D] Is RL dead/worth researching these days? by [deleted]

Not exactly. In batch RL the data they train on are real (state, action, next state, reward) tuples from real agents interacting with real environments.

They improve the policy offline. In RLHF there actually is no env. And the policy is just standard LLM decoding.