Viewing a single comment thread. View all comments

idrajitsc t1_ix8lrk0 wrote

I mean, I'm not really sure what your ask is. People do work on RL for NLP. It just doesn't offer any huge advantage, and the reason your intuition doesn't translate to an actual advantage is because writing a reward function that reproduces the human feedback a baby receives is essentially impossible. And not just in a, it's hard but if we put enough work into it we can figure it out, kind of way.

2