Viewing a single comment thread. View all comments

visarga t1_ja36ih0 wrote

We can have a model trained on a large video dataset, and then fine-tuned for various tasks like GPT3.

Using YouTube as training data we'd get video + audio which decompose in image, movement, body pose, intonation, text transcript, metadata all in parallel. This dataset could dwarf the text datasets we have now, and it will have lots of information that doesn't get captured in text, such as physical movements for achieving a specific task.

I think the OP was almost right. The multi-modal AI will be a good base for the next step, but it needs instruction tuning and RLHF. Just pre-training is not enough.

One immediate application I see - automating desktop activities. After watching many hours of screen casting from YT, the model will learn how to use apps and solve tasks at first sight like GPT-3.5, but not limited to just text.

2