Submitted by TheCockatoo t3_10m1sdm in MachineLearning
ThatInternetGuy t1_j6300ue wrote
Stable Diffusion is made up of a VAE image encoder, CLIP text encoder, U-Net which was trained in a transformer/diffusion process.
GAN-based text2image is made up mainly of ResNet which was trained using a generator+discriminator process.
IMO, you're looking for differences between U-Net and ResNet. There are a few differences:
- Training a ResNet in that fashion is much more unpredictable.
- With ResNet, you have to code a good custom discriminator (the component that scores the output images) for your specific model. With U-NET, the diffusion process will take care of all by itself.
- ResNet output is limited to 128x128. (Maybe scalable tho)
- Scaling a ResNet doesn't necessarily make it more capable; its performance doesn't scale up to the amount of training data. A U-Net can scale as big as the VRAM allows and will take advantage of more training data.
For the big guys, really, they need that last bullet point. They want a model that can scale up with the amount of training data so that they can just throw more powerful hardware to achieve more competitive results. A GAN can cost several thousand dollars to train and that would hit its performance ceiling too soon. A Latent Diffusion model can cost as much as you can afford and its performance will gradually improve with more resources thrown at it.
Viewing a single comment thread. View all comments