Submitted by Ok-Variety-8135 t3_11c9zum in singularity
Imagine a language model that can communicate in a special language, whose language tokens are made up of two parts of information:
- Sensation (like seeing, hearing, touching, etc.)
- Actuator state (like controlling the body's movements or speaking).
When the model is speaking, the predicted token guides a robot’s behavior. Sensation part become the imagination/thought of the robot, actuator state decides the movement of the robot. If the actuator state contains microphone state, then the robot is actually speaking.
When the model is hearing, the next token will come from the robot body, the robot’s environment sensors and actuator sensors.
And the model will decide when to speak and when to hear by itself.
All tokens, whether spoken or heard, forming a “conversation history”. The history will be evaluated by a reward model that defines the purpose of the robot.
The model will update its weights continuously using reinforcement learning and the evaluated conversation history.
Old conversation history can be deleted after being encoded into model weight.
​
In short, a “ChatGPT” that models the language between brain and body.
turnip_burrito t1_ja2f02y wrote
Sure, you can do it if you have enough data, and a powerful enough computer.
Idk how you're going to do reinforcement learning to update the transformer weights though (I assume you want to use a transformer?). That's a lot of computation. The bigger your model is, the slower this update step will be.
Are you separating hearing and speaking/moving in time? Like are they separate steps that can't happen at the same time? My question then is why not make them simultaneous?