Viewing a single comment thread. View all comments

LetterRip t1_j6v57y5 wrote on February 2, 2023 at 2:33 AM

Mostly the language model - Imagen is using T5-XXL (the 4.6 billion parameters), Dall-E 2 uses GPT-3 (presumably 2.7B not the much larger variants used for ChatGPT). SD is just using CLIP without anything else. The more sophisticated the language model, the better the image generation can understand what you want. CLIP is close to using bag of words.