Submitted by jayalammar t3_10dfex7 in MachineLearning

I'll be honest with you, it took me months to wrap my head around diffusion models. A couple of iterations of a blog post later and this is my best shot at a gentle intro to Stable Diffusion and how it works.

https://youtu.be/MXmacOUJUaw

The part that took the most reworking is forward diffusion and how to best describe it. Thanks to the many people acknowledged in the blog post who have helped me both understand it and explain it better.

Hope you find it helpful. Let me know if you have any questions or feedback.

189

Comments

You must log in or register to comment.

wywywywy t1_j4los24 wrote

Very nice video. Clear & concise explanation

6

OJMofo t1_j4m4f0g wrote

Great video! Comprehensive overview that’s digestible for the target audience.

2

chaosmosis t1_j4mjxh9 wrote

Are the 77 token embedding vectors just concatenated together as ClipText's output? Is there any structure to their ordering as processed by the Image Information Creator? Assuming a trained model, would permuting the vectors' order before passing them forward to the next subcomponent break anything?

General comment: it's surprising to me that there aren't any instabilities introduced by stapling models together like this. If someone had come up to me with this description of an architecture several years ago, I would have told them that it was too complicated to work. Not sure what about my intuitions I should change in response to observing that this works despite them.

1

juniperking t1_j4mma6c wrote

>General comment: it's surprising to me that there aren't any instabilities introduced by stapling models together like this. If someone had come up to me with this description of an architecture several years ago, I would have told them that it was too complicated to work. Not sure what about my intuitions I should change in response to observing that this works despite them.

probably the most important thing that makes model configurations like this work is that they're very large and generalizable. a lot of prior research often focuses on finetuning for a specific task or dataset but the fact that clip (for example) is able to learn generalized text + image embeddings across multiple domains helps downstream training work

3

NotDoingResearch2 t1_j4oaz7r wrote

Understanding what stable diffusion models are is easy. Understanding why they work and VAEs don’t is hard, especially when you consider they are just defective VAEs.

1

emreddit0r t1_j523nlc wrote

One thing I find glossed over/lacking in the diffusion model materials is the contribution of the UNet.

Coming from someone that is just trying to catch up on what's going on, the UNet seems to play a huge role (if I understand right, this is where the convolutional neural networks are discovering 2d features.)

Relatively speaking, CNNs are kind of old news.. but they're a big deal. Unless I have something wrong? Do you know where I can learn more about how the UNet aspect works in depth?

1