moschles t1_izhydos wrote on December 9, 2022 at 6:02 AM

> To our knowledge, we are the first to interpret large diffusion models from a visuolinguistic perspective, which enables future lines of research.

It seems like the lines of research here would be automated photo captioning.

tetrisdaemon OP t1_izi47x8 wrote on December 9, 2022 at 7:12 AM

For sure, and how linguistics can guide Stable Diffusion to produce better images. For example, if we already understand how objects should relate on the language side (e.g., "a giraffe and a zebra" should probably produce two distinct animals, unlike that observed in the paper), we can twiddle the attention maps so that the giraffe and the zebra are separate.