Submitted by TheCockatoo t3_10m1sdm in MachineLearning
I guess what I'm trying to figure out is, what are the main reasons that DMs are outperforming GANs in text2img generation? Thanks!
Submitted by TheCockatoo t3_10m1sdm in MachineLearning
I guess what I'm trying to figure out is, what are the main reasons that DMs are outperforming GANs in text2img generation? Thanks!
And while being easier to train, they give better results.
Diffusion models are also so much more versatile in their application because of their iterative process.
You can do inpainting or img-to-img for example by just conditioning the noise in different ways. You would have to retrain the whole GAN to achieve that.
>It's simpler (which leads to progress)
I wouldn't say current diffusions models are simpler, in fact they are much more complex than even the most "complex" GAN architectures. However it's exactly because of all the other points that they have become this complex. A vanilla GAN would never be able to endure this much tweaking without mode collapse. Compare that to even the most basic score-based models, which are always stable.
Sometimes, the "It just works™" proposition is much more appealing than pipeline simplicity or speed.
I'd guess that it's an easier optimization problem. GANs are known to have stability issues during training, likely due to the adversarial formulation.
I think a more interesting question is why it also performs better than VAEs, since diffusion models also fall under the category of variational inference. Again I'd assume it's an easier optimization problem due to having a large number of denoising steps. Perhaps a technique like DRAW could match diffusion models if used with more steps? Not sure.
I always see diffusion/score models contrasted against VAEs, but is there really a good distinction? Especially given latent diffusion and IAFs and all the other blurry lines. I feel like any time you're doing forward training & backwards inference trained with an ELBO objective, it should count as a VAE.
I guess for it to really count as a variational autoencoder you need to be reconstructing the input
Isn't reconstructing the input exactly what the denoising objective does?
Iterative refinement seems to be a big part of it. In a GAN, your network has to produce one image in a single forward pass. In diffusion models, the model actually sees the intermediate steps over and over and can make gradual improvements. Also, if you think about what the noise does, in the first few steps it will remove all small details and only keep low frequent, large structures. Basically, in the first steps, the model kind of has to focus on overall composition. Then, as the noise level goes down, it can gradually start adding all the small details. On a more mathematical level, the noise smoothes the distribution and widens the support in the [0,1]^D cube (D=image dimension, like 256x256x3). Typically people assume that this manifold is low-dimensional which can make sampling from it hard.
Some support for this claim is that people were able to improve other generative models like autoregressive models using similar noisy distributions. Also, you can run GANs to sample from the intermediate distributions which works better than standard GANs.
It's hard to scale up, but check this new paper out: https://arxiv.org/pdf/2301.09515v1.pdf
It's quite hard to answer these questions for neural networks. We don't really know if GANs are forever worse than Latent Diffusion Models, they are now, but previously they weren't, and perhaps in the future GANs will outperform LDMs. It seems that how we configure the denoising task now is better suited for text2img than how we configure GANs now.
A model usually outperforms another when it's more efficient in how it stores information in its weights. Successive conditioned denoising layers seem to be more efficient for this task, but it also requires a good enough perceptual loss, a good enough encoder, etc. We know that these networks could compete with GANs but maybe they were just not good enough before, or not combined in a good enough way.
I feel like if you feed it a ton of videos it should be able to be complex enough to spit back out a decent High quality Image
Anyone find a quick process than the batch image stitching?
[deleted]
GANs produce an image "cut from the whole cloth" at once.
Diffusion models are using a trick -- wherein between rounds of incremental noise removal, they perform a super resolution round.
Technically speaking, you could start from GAN output, and then take it through rounds of super-resolution. The result would look a lot like what diffusion models produce. This leaves a question as to how the new details would be guided, ( or more technically), what the super resolution features would be conditioned upon. If you are going to condition them on text embeddings, you might as well condition the whole process on the same embedding . . . now you just have a diffusion model.
A second weakness of GANs is the narrowness of their variety. When made to produce vectors corresponding to a category "dog" , they tend to produce to nearly exactly the same dog each time.
Stable Diffusion is made up of a VAE image encoder, CLIP text encoder, U-Net which was trained in a transformer/diffusion process.
GAN-based text2image is made up mainly of ResNet which was trained using a generator+discriminator process.
IMO, you're looking for differences between U-Net and ResNet. There are a few differences:
For the big guys, really, they need that last bullet point. They want a model that can scale up with the amount of training data so that they can just throw more powerful hardware to achieve more competitive results. A GAN can cost several thousand dollars to train and that would hit its performance ceiling too soon. A Latent Diffusion model can cost as much as you can afford and its performance will gradually improve with more resources thrown at it.
HateRedditCantQuitit t1_j60rtsa wrote
This isn't the whole answer, but GANs are super hard to train, while diffusion models are an instance of some much more well understood methods (MLE, score matching, variational inference). That leads to a few things:
- It's more reliable to converge (which leads to enthusiasm)
- It's easier to debug (which leads to progress)
- It's better understood (which leads to progress)
- It's simpler (which leads to progress)
- It's more modular (which leads to progress)
Hypothetically, it could even be that the best simple GAN is better than the best simple diffusion model, but it's easier to iterate on diffusion models, which means we'd still be more able to find the good ways to do diffusion.
tl;dr when I worked on GANs, I felt like a monkey hitting a computer with a wrench to make it work, while when I work on diffusion models, I feel like a mathematician deriving Right Answers™.