I'll be honest with you, it took me months to wrap my head around diffusion models. A couple of iterations of a blog post later and this is my best shot at a gentle intro to Stable Diffusion and how it works.

The part that took the most reworking is forward diffusion and how to best describe it. Thanks to the many people acknowledged in the blog post who have helped me both understand it and explain it better.

Hope you find it helpful. Let me know if you have any questions or feedback.

Comments

You must log in or register to comment.

wywywywy t1_j4los24 wrote on January 16, 2023 at 4:35 PM

Very nice video. Clear & concise explanation

OJMofo t1_j4m4f0g wrote on January 16, 2023 at 6:12 PM

Great video! Comprehensive overview that’s digestible for the target audience.

[deleted] t1_j4mh090 wrote on January 16, 2023 at 7:30 PM

[deleted]

chaosmosis t1_j4mjxh9 wrote on January 16, 2023 at 7:48 PM

Are the 77 token embedding vectors just concatenated together as ClipText's output? Is there any structure to their ordering as processed by the Image Information Creator? Assuming a trained model, would permuting the vectors' order before passing them forward to the next subcomponent break anything?

General comment: it's surprising to me that there aren't any instabilities introduced by stapling models together like this. If someone had come up to me with this description of an architecture several years ago, I would have told them that it was too complicated to work. Not sure what about my intuitions I should change in response to observing that this works despite them.

juniperking t1_j4mma6c wrote on January 16, 2023 at 8:03 PM

>General comment: it's surprising to me that there aren't any instabilities introduced by stapling models together like this. If someone had come up to me with this description of an architecture several years ago, I would have told them that it was too complicated to work. Not sure what about my intuitions I should change in response to observing that this works despite them.

probably the most important thing that makes model configurations like this work is that they're very large and generalizable. a lot of prior research often focuses on finetuning for a specific task or dataset but the fact that clip (for example) is able to learn generalized text + image embeddings across multiple domains helps downstream training work

blimpyway t1_j4pi713 wrote on January 17, 2023 at 10:19 AM

The order of the words/tokens is normally encoded via positional embeddings that are added each to their respective token embedding. See e.g. https://machinelearningmastery.com/a-gentle-introduction-to-positional-encoding-in-transformer-models-part-1/

NotDoingResearch2 t1_j4oaz7r wrote on January 17, 2023 at 2:51 AM

Understanding what stable diffusion models are is easy. Understanding why they work and VAEs don’t is hard, especially when you consider they are just defective VAEs.

emreddit0r t1_j523nlc wrote on January 19, 2023 at 9:22 PM

One thing I find glossed over/lacking in the diffusion model materials is the contribution of the UNet.

Coming from someone that is just trying to catch up on what's going on, the UNet seems to play a huge role (if I understand right, this is where the convolutional neural networks are discovering 2d features.)

Relatively speaking, CNNs are kind of old news.. but they're a big deal. Unless I have something wrong? Do you know where I can learn more about how the UNet aspect works in depth?