Viewing a single comment thread. View all comments

Zermelane t1_itfr3j9 wrote

There are so, so many incremental steps between here and straight-out text-to-movie that will each be mind-blowing advances on their own.

  • Much more controllable text-to-image, that actually consistently stays on model, not to mention consistently giving people the right number of limbs
  • Voice synthesis that can actually stay convincing and express different emotions through hours of generated audio
  • Audio synthesis to generate all of the sounds of a movie, in addition to the voices
  • Video synthesis that has all of those above properties, not to mention having far greater detail, resolution and accuracy than what we have now
  • Text generation that can maintain plot coherence and develop a plot through the length of an entire movie script
  • Either an amazing amount of engineering work to put together a system using separate models for all of the above (at least prompt-to-script and script-to-video), or maybe even more astonishingly, a single system somehow doing it all end-to-end
  • All of the above as tools integrated into existing workflows
  • Systems that can critique and edit the text, image, audio and video outputs of other AIs, the way a workflow with an image generation system right now might involve a human doing cherry-picking and inpainting

I'm not saying we mightn't get all the way to text-to-movie fast. I am saying that even if it took even several decades to happen, those would still be decades full of astonishing advances, most of which I couldn't even predict here.

51

ReadSeparate t1_itgs97l wrote

There is one big assumption in this, and that's that we won't get ALL of those things out of scale alone. It's entirely possible someone builds a multi-modal model trained on text, video, and audio, and a text-to-movie generator is simply a secondary feature of such a model.

If this does happen, we could see it as soon as 2-5 years from now, in my opinion.

The one major breakthrough I DO think we need to see before text-to-movie is something to replace Transformers, as they aren't really capable of long term memory without hacks, and the hacks don't seem very good. You need long term memory to have a coherent movie.

I think it's pretty likely that everything else will be accomplished through scale and multi-modality.

16

red75prime t1_itk6c0n wrote

I'm sure that any practical AI system that will be able to generate movies will not do it all by itself. It will use external tools to not waste its memory and computational resources on mundane tasks of keeping exact 3d positions of objects and remembering all the intricacies of their textures and surface properties.

2

alisaxoxo t1_itggz22 wrote

Text-to-movie is probably far out. However, it doesn’t necessarily need to be designed how you’ve outlined it. Text-to-image is great but it probably won’t be used for creations that need this level of consistency.

Why limit yourself to chaotic text prompts when you could use an image, a model, an entire 3D rendering of the scene or maybe even multiple iterations of all of these. Stable Diffusion’s img-to-img is already something of a proof of concept for this. With AI generated 3D models on the way, I’d bet we’re getting closer to that. That could almost entirely fix the issues with limbs and consitency since it’d have a 3D reference of how those things should look. This might not be outright possible at the moment, but I genuinely don’t believe it’ll be hard to implement in the long term. Especially if we combine AI generation with some well-tested algorithmic approaches.

Video synthesis is still being developed but it’s important to highlight our standard of quality. Photorealistic AI generated live-action movies are still far out, but what about animated shows? Something at the level of The Last Airbender is already pretty damn close to being possible if you ask me. Other popular animation styles like anime probably wouldn’t be too far off from there. After that we might get Pixar type films and lastly I’d assume photorealistic.

Text generation that can maintain plot coherence is already demonstratable with GPT-3. It isn’t perfect but it’s already decent.

But yes, ultimately it’ll require a lot of engineering and STILL wouldn’t be full on text-to-movie. A human would still need to be involved for fine touches. That being said, the amount of work required will drop drastically, which is an important first step.

7

HumanSeeing t1_itg8ixy wrote

So basically we will have fully generated movies when we have AGI, like, might as well be that lol.

4

DEATH_STAR_EXTRACTOR t1_itowpzk wrote

I agree. This is what I'm saying, I wonder if even I started this trend. I mean I just started saying this or thinking this this year.

1

monsieurpooh t1_ithpe6v wrote

The thing is (as GPT itself has proven, since it can be used for image generation despite being made for text) sometimes improving a model in a general way will solve multiple problems at once.

2