arg_max t1_j0z1p30 wrote on December 20, 2022 at 2:00 PM

When? Probably now if someone decides to put enough money into it.
All the big Text-To-Image models like Dall-E, Imagen, Stable Diffusion are not very novel in terms of metrology. They all rely heavily on existing ideas and then combine them with more compute, bigger datasets and some tweaks.

Videos are not much more than 3D images with certain temporal constraints. There are already small scale Diffusion models for videos and I'm not saying that it's trivial to get longer videos, recurrent learning often is a bit tricky but I don't see why it would be impossible. Probably takes a few years before consumer hardware can run video generation though, after all we just about manage images at the moment.