martianunlimited t1_j8tm2qd wrote on February 16, 2023 at 9:27 PM

This is an ELI5 explanation as to why we use noise and conditionally denoise the noise with the text encoder: Look at the clouds, and I tell you that I see an elephant in the clouds. It is easier to imagine the elephant in the clouds than if i tell you to imagine that there is an elephant in the piece of white paper.

(the less ELI5 explanation is that the entropy going from noise to an image is lower than that of from a uniform image) If you want to see that for yourself, with a bit of programming knowledge you can write your own diffuser pipeline to skip the noise adding stage and try img2img from a blank image. (it's literally just ~3 lines of edits)

(side note: someone brought up a similar question but in a different vein, (removing the random seed)