-xXpurplypunkXx- t1_j6ulhcj wrote on February 2, 2023 at 12:08 AM

Reply to comment by koolaidman123 in [R] Extracting Training Data from Diffusion Models by pm_me_your_pay_slips

I can't tell which is crazier: that it memorizes images at all, or that memorization is such a small fraction of its overall outputs.

Very interesting. I'm wondering how sensitive this methodology is to finding instances of memorization though; maybe this is the tip of the iceberg.

LetterRip t1_j6ut9kc wrote on February 2, 2023 at 1:04 AM

> I can't tell which is crazier: that it memorizes images at all, or that memorization is such a small fraction of its overall outputs.

It sees most images between 1 (LAION 2B) and 10 times (aesthetic dataset is multiple epochs). It simply can't learn enough from an image to learn that much about it with that few exposures. If you've tried fine tuning a model on a handful of images it takes a huge numbers of exposures to memorize an image.

Also the model capacity is small enough that on average it can learn 2 bits of unique information per image.

-xXpurplypunkXx- t1_j6v3fab wrote on February 2, 2023 at 2:20 AM

Thanks for context. Maybe a little too much woo in my post.

For me, the fidelity to decide which images are completely stored is either an interesting artifact or an interesting piece of the model.

But regardless it is very un-intuitive to me with respect to how diffusion models would train and behave, due to both mutation of training images as well as foreseeable lack of space to encode that much info into a single model state. Admittedly don't have much working experience with these sort of models.

pm_me_your_pay_slips OP t1_j6vgxpe wrote on February 2, 2023 at 4:04 AM

>on average it can learn 2 bits of unique information per image.

The model capacity is not spent on learning specific images, but on learning the mapping from noise to latent vectors corresponding to natural images. Human-made or human-captured images have common features shared across images, and that's what matters for learning the mapping.

As an extreme example, imagine you ask 175 million humans to draw a random number between 0 and 9 on a piece of paper. you then collect all the images into a dataset of 256x256 images. Would you still argue that the SD model capacity is not enough to fit that hypothetical digits dataset because it can only learn 2 bits per image?

LetterRip t1_j6vo0zz wrote on February 2, 2023 at 5:09 AM

> The model capacity is not spent on learning specific images

I'm completely aware of this. It doesn't change the fact that the average information retained per image is 2 bits. (2GB of parameters/total images learned on in dataset).

> As an extreme example, imagine you ask 175 million humans to draw a random number between 0 and 9 on a piece of paper. you then collect all the images into a dataset of 256x256 images. Would you still argue that the SD model capacity is not enough to fit that hypothetical digits dataset because it can only learn 2 bits per image?

I didn't say it learned 2 bits of pixel data. It learned 2 bits of information. The information is in a higher dimensional space, so it is much more informative then 2 bits of pixel space data, but it is still an extremely small amount of information.

Given that it often takes about 1000 repetitions of an image to approximately memorize the key attributes. We can infer it takes about 2**10 bits on average to memorize an image. So on average it learns about 1/1000 of the available image data per time it sees an image, or about 1/2 kB equivalent of compressed image data.