simmol t1_je77ct0 wrote
The training via broad sensory inputs will probably come in the multimodal LLMs. So essentially, the next generation LLMs will be able to look at an image and either be able to answer questions regarding that particular image (GPT-4 probably has this capability) or just treat the image itself as the input and say something about the image unprompted (GPT-4 probably does not have this capability). I think the latter ability will make the LLM seem more AGI like given that the current LLMs only respond to the inquiry of the users. But if the AGI can respond to an image and if you put this inside a robot, then presumably, the robot can respond naturally to the ever-changing image that is seen from its sensors and talk about it accordingly.
I think once this happens, then the LLM will seem less like a tool and more like a being. This probably does not solve the symbolic logic part of building up knowledge from simple set of rules, but that is probably a separate task on its own that will not be solve by multimodality but by layering the current LLM with another deep learning model (or via APIs/plugins with third party apps).
Green-Future_ OP t1_je78a0k wrote
Very insightful response, thank you for sharing.
Viewing a single comment thread. View all comments