>Larger/better language models have a significant effect on the quality of image generation models. Source: Google Imagen paper by Saharia et. al.. Figure A.5.
New Stable Diffusion models have to be trained to utilize the OpenCLIP model. That's because many components in the attention/resnet layer are trained to deal with the representations learned by CLIP. Swapping it out for OpenCLIP would be disruptive.
In that training process, however, OpenCLIP can be frozen just like how CLIP was frozen in the training of Stable Diffusion / LDM.
jayalammar OP t1_isnkcuy wrote
Reply to comment by mrflatbush in [R] The Illustrated Stable Diffusion by jayalammar
Oh, okay, I understand you now. These are actual examples from the dataset. These were the captions of these images in the LAION Aesthetic dataset. https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6.5plus