Submitted by [deleted] t3_11d4ka5 in MachineLearning
[deleted]
Submitted by [deleted] t3_11d4ka5 in MachineLearning
[deleted]
I find it a little weird that RLHF is considered to be reinforcement learning.
The human feedback is collected offline and forms a static dataset. They use the objective from PPO but it's really more of a form of supervised learning. There isn't an agent interacting with an env, the "env" is just sampling text from a static dataset and the reward is the score from a neural net trained on a static dataset.
So basically, batch reinforcement learning / offline RL? The family of algorithms is useful for recommender systems, amongst others.
Not exactly. In batch RL the data they train on are real (state, action, next state, reward) tuples from real agents interacting with real environments.
They improve the policy offline. In RLHF there actually is no env. And the policy is just standard LLM decoding.
If all you do is following trends, and whats in the "spotlight" you probably don't care about your research, but care about the accolades.
Do you specifically mean applications in NLP? RL seems to have a lot of applications in fields like game playing, robotics, neural theorem proving, etc. which seems to have no direct connection with LLMs
Seems more like an AskML question.
But RL is for situations when you can't backprop the loss. It's noisier than supervised learning. So if you can use supervised learning, then that's what you should generally use.
RL is still used, for example the recent GATO and Dreamer v3. Or used in training an LLM to use tools like in toolformer. And also OpenAI's famous RLHF, which stands for reinforcement learning with human feedback. This is what they use to make ChatGPT "aligned" although in reality it doesn't get there.
>toolformer
Are you sure there's RL in Toolformer? I thought it was mostly self-supervised and fine-tuned.
> Toolformer
....oh you're right it didn't. I assumed they let it use any tool which would need RL. But it seems like they had pre-labelled ways to use tools.
Thanks for pointing that out.
fyi, GATO used imitation learning, which is closer to supervised than RL.
RL + NLP and RL + Vision would have some future, I guess. It would be an integral part.
Imo it depends on what you mean by RL. If you interperet RL as the 2015-19 collection of algorithms that train deep NN agents tabula rasa (from zero knowledge), I'd be inclined to agree that it doesn't seem a particularly fruitful research direction to get into. But if you interperet RL as a general problem setting, where an agent must learn in a sequential decision making environment, you'll see that it's not going away.
To me the most interesting recent research in RL (or whatever you want to name it) is figuring out how to leverage existing datasets or models to get agents working well in sequential environments. Think SayCan, ChatGPT, Diffusion BC...
Yann LeCun said on Twitter that it is dead... go figure.
Hottest ever. RHFL, robotics
Chat gpt works using a combination of rl and llm
You're correct that RL has been struggling. Not because of the impressive results by LLMs and image generators, but because the progress within RL has been very slow. People who say otherwise have just forgotten what fast progress looks like; remember 2015-2018 when we first saw human-level Atari play, superhuman Go play, and then superhuman Atari play, as well as impressive results in Starcraft and Dota. I think if you'd asked someone back in 2018 what the next 5 years of RL would look like they would have expected progressively more complicated games to fall, and for agents to graduate from playing with game-state information, as AlphaStar and OpenAI Five did, to besting humans on a level playing field by playing based off of the pixels on the screen the way that agents in Atari do. This hasn't happened.
Instead it turned out that all of this progress was constrained to narrow fields; specifically, games with highly limited input spaces (hence why OpenAI Five and AlphaStar had to take the gamestate directly, which means they get access to information that humans don't) and games where exploration is easy (can be handled in large part or entirely by making random moves some percentage of the time).
I don't think this means the field is dead mind you but it certainly hasn't been making much progress lately.
do what you like and you believe. Otherwise you are just doing something because of what other people think.
Computers are largely failed attempts at doing what our brains do. Our brains use RL (i.e. dopamine + serotonin) and neural networks. It is probably useful to study for that reason alone :shrug:
PassingTumbleweed t1_ja6w9ai wrote
It's weird to read this when RLHF has been one of the key components of chat GPT and friends