Submitted by dahdarknite t3_10r5gku in MachineLearning
LetterRip t1_j6v57y5 wrote
Mostly the language model - Imagen is using T5-XXL (the 4.6 billion parameters), Dall-E 2 uses GPT-3 (presumably 2.7B not the much larger variants used for ChatGPT). SD is just using CLIP without anything else. The more sophisticated the language model, the better the image generation can understand what you want. CLIP is close to using bag of words.
Viewing a single comment thread. View all comments