ThatInternetGuy t1_j6300ue wrote on January 27, 2023 at 10:11 AM

Stable Diffusion is made up of a VAE image encoder, CLIP text encoder, U-Net which was trained in a transformer/diffusion process.

GAN-based text2image is made up mainly of ResNet which was trained using a generator+discriminator process.

IMO, you're looking for differences between U-Net and ResNet. There are a few differences:

Training a ResNet in that fashion is much more unpredictable.
With ResNet, you have to code a good custom discriminator (the component that scores the output images) for your specific model. With U-NET, the diffusion process will take care of all by itself.
ResNet output is limited to 128x128. (Maybe scalable tho)
Scaling a ResNet doesn't necessarily make it more capable; its performance doesn't scale up to the amount of training data. A U-Net can scale as big as the VRAM allows and will take advantage of more training data.

For the big guys, really, they need that last bullet point. They want a model that can scale up with the amount of training data so that they can just throw more powerful hardware to achieve more competitive results. A GAN can cost several thousand dollars to train and that would hit its performance ceiling too soon. A Latent Diffusion model can cost as much as you can afford and its performance will gradually improve with more resources thrown at it.