Submitted by tetrisdaemon t3_zgg7y7 in MachineLearning

​

https://preview.redd.it/m2pg8yhahr4a1.png?width=2117&format=png&auto=webp&s=c6ef4cbef10f5d04045fb606e5123fb7a64f2ed5

Paper: What the DAAM: Interpreting Stable Diffusion Using Cross Attention (arXiv paper, codebase)

Abstract:

Large-scale diffusion neural networks represent a substantial milestone in text-to-image generation, but they remain poorly understood, lacking interpretability analyses. In this paper, we perform a text-image attribution analysis on Stable Diffusion, a recently open-sourced model. To produce pixel-level attribution maps, we upscale and aggregate cross-attention word-pixel scores in the denoising subnetwork, naming our method DAAM. We evaluate its correctness by testing its semantic segmentation ability on nouns, as well as its generalized attribution quality on all parts of speech, rated by humans. We then apply DAAM to study the role of syntax in the pixel space, characterizing head--dependent heat map interaction patterns for ten common dependency relations. Finally, we study several semantic phenomena using DAAM, with a focus on feature entanglement, where we find that cohyponyms worsen generation quality and descriptive adjectives attend too broadly. To our knowledge, we are the first to interpret large diffusion models from a visuolinguistic perspective, which enables future lines of research.

Authors: Raphael Tang, Linqing Liu, Akshat Pandey, Zhiying Jiang, Gefei Yang, Karun Kumar, Pontus Stenetorp, Jimmy Lin, Ferhan Ture

85

Comments

You must log in or register to comment.

Parzival_007 t1_izhqde5 wrote

Hi, I checked your work before you posted, and daam it's good. Well done !

9

moschles t1_izhydos wrote

> To our knowledge, we are the first to interpret large diffusion models from a visuolinguistic perspective, which enables future lines of research.

It seems like the lines of research here would be automated photo captioning.

7

tetrisdaemon OP t1_izi47x8 wrote

For sure, and how linguistics can guide Stable Diffusion to produce better images. For example, if we already understand how objects should relate on the language side (e.g., "a giraffe and a zebra" should probably produce two distinct animals, unlike that observed in the paper), we can twiddle the attention maps so that the giraffe and the zebra are separate.

3

JClub t1_izij5x5 wrote

Hey! I'm the author of https://github.com/JoaoLages/diffusers-interpret

I have also tried to collect attentions in the diffusion process but the matrices with (text size, image size) were too big to keep in RAM/VRAM, how did you solve that problem?

2

tetrisdaemon OP t1_izjm0ov wrote

Cool, nicely done repository. Are you referring to the [16, 4096-ish, 77] cross-attention matrices? I maintained a streaming sum over matrices of the same size on a 64GB (though it does work with 32GB) RAM and 24GB VRAM machine.

3

JClub t1_izjnf35 wrote

Damn then this method can only run on such hardware, the attention weights are very heavy!

1

tetrisdaemon OP t1_izk7fk0 wrote

Yeah, moving forward it might help to have a disk caching mode.

2

calciumcitrate t1_izigomm wrote

/u/tetrisdaemon Any idea what part of the diffusion process might be causing the failure modes? (the latent representations, CLIP embeddings, or cross attention conditioning etc.)

My initial guess was that maybe the CLIP embeddings aren't fine grained enough to represent some relationships between entities in a sentence, but if I understand correctly, the cross-attention conditioning adds some additional text supervision (I'm assuming X in eq 4 and 5 is some transformer representation of the prompt) - and it does seem like some dependency relationship are being captured.

1

tetrisdaemon OP t1_izjp9nc wrote

I'm looking into it, but I'm guessing it's the CLIP embeddings, so disentanglement might need to happen at that level. Some supporting evidence is that even if we set the cross attention to zero (for some words), it'll still reflect in the final image, indicating that the word representations are mixed in CLIP.

2

Purplekeyboard t1_izih5hd wrote

>descriptive adjectives attend too broadly.

If this means that words in a prompt modify the whole prompt and not just the phrase the word is part of, everyone who uses Stable Diffusion knows this. If your prompt is "girl, chair, sitting, computer, library, earrings, necklace, blonde hair, hat", and you modify that to specify "red chair", you're likely to also get a red hat, or now the girl will be wearing a red shirt, or various other parts of the image may turn red.

If you change the prompt from library to outdoors, and add the word snow, it will likely be snowing, but also the earrings or a pendant on the necklace may now be in the shape of a snowflake.

This is how stable diffusion works.

−1

tetrisdaemon OP t1_izjmb5s wrote

This is a good observation. Actually, in the paper we try out "{rusty, wooden, metallic} shovel in a clean shed," and it still made the shed rusty. Moving forward, we do plan to do the same thing to the other ball prompt.

2