Submitted by pm_me_your_pay_slips t3_10r57pn in MachineLearning
RandomCandor t1_j6uaa0o wrote
Fascinating. I always thought this sort of thing was either very difficult or impossible.
koolaidman123 t1_j6ug73c wrote
it is, the memorization rate is like 0.03% or less
https://twitter.com/BlancheMinerva/status/1620781482209087488
IDoCodingStuffs t1_j6uk67h wrote
~~In this case the paper seems to use a very conservative threshold to avoid false positives -- l2 distance < 0.1, full image comparison. Which makes sense for their purposes, since they are trying to establish the concept rather than investigating its prevalence.
It is definitely a larger number than 0.03% when you pick a threshold to optimize the F score rather than just precision. How much larger? That's a bunch of follow-up studies.~~
starstruckmon t1_j6v1qv0 wrote
They also manually annotated the top 1000 results, adding only 13 more images. The number you're replying to counted those.
DigThatData t1_j6uxsdj wrote
> full image comparison.
that's not actually the metric they used precisely for the reasons you suggest: they found it to be too conservative. Specifically, they found they were getting too-high scores from images that had large black backgrounds. they chunked up each image into regions and used the score for the most dissimilar (but corresponding) regions to represent the whole image.
Further, I think they demonstrated their methodology probably wasn't too conservative when they were able to use the same approach to get a 2.3% (concretely: 23 memorized images in 1000 tested prompts) hit rate from Imagen. This hit rate is very likely a big overestimate of Imagen's propensity to memorize, but it demonstrates that the author's L2 metric has the ability to do its job.
Also, it's not like the authors didn't look at the images. They did, and found a handful more hits, which that 0.03% is already accounting for.
[deleted] t1_j6uyzlk wrote
[deleted]
-xXpurplypunkXx- t1_j6ulhcj wrote
I can't tell which is crazier: that it memorizes images at all, or that memorization is such a small fraction of its overall outputs.
Very interesting. I'm wondering how sensitive this methodology is to finding instances of memorization though; maybe this is the tip of the iceberg.
LetterRip t1_j6ut9kc wrote
> I can't tell which is crazier: that it memorizes images at all, or that memorization is such a small fraction of its overall outputs.
It sees most images between 1 (LAION 2B) and 10 times (aesthetic dataset is multiple epochs). It simply can't learn enough from an image to learn that much about it with that few exposures. If you've tried fine tuning a model on a handful of images it takes a huge numbers of exposures to memorize an image.
Also the model capacity is small enough that on average it can learn 2 bits of unique information per image.
-xXpurplypunkXx- t1_j6v3fab wrote
Thanks for context. Maybe a little too much woo in my post.
For me, the fidelity to decide which images are completely stored is either an interesting artifact or an interesting piece of the model.
But regardless it is very un-intuitive to me with respect to how diffusion models would train and behave, due to both mutation of training images as well as foreseeable lack of space to encode that much info into a single model state. Admittedly don't have much working experience with these sort of models.
pm_me_your_pay_slips OP t1_j6vgxpe wrote
>on average it can learn 2 bits of unique information per image.
The model capacity is not spent on learning specific images, but on learning the mapping from noise to latent vectors corresponding to natural images. Human-made or human-captured images have common features shared across images, and that's what matters for learning the mapping.
As an extreme example, imagine you ask 175 million humans to draw a random number between 0 and 9 on a piece of paper. you then collect all the images into a dataset of 256x256 images. Would you still argue that the SD model capacity is not enough to fit that hypothetical digits dataset because it can only learn 2 bits per image?
LetterRip t1_j6vo0zz wrote
> The model capacity is not spent on learning specific images
I'm completely aware of this. It doesn't change the fact that the average information retained per image is 2 bits. (2GB of parameters/total images learned on in dataset).
> As an extreme example, imagine you ask 175 million humans to draw a random number between 0 and 9 on a piece of paper. you then collect all the images into a dataset of 256x256 images. Would you still argue that the SD model capacity is not enough to fit that hypothetical digits dataset because it can only learn 2 bits per image?
I didn't say it learned 2 bits of pixel data. It learned 2 bits of information. The information is in a higher dimensional space, so it is much more informative then 2 bits of pixel space data, but it is still an extremely small amount of information.
Given that it often takes about 1000 repetitions of an image to approximately memorize the key attributes. We can infer it takes about 2**10 bits on average to memorize an image. So on average it learns about 1/1000 of the available image data per time it sees an image, or about 1/2 kB equivalent of compressed image data.
DigThatData t1_j6ugpgr wrote
very difficult is correct. The authors identified 350,000 candidate prompt/image pairs that were likely to have been memorized because they were duplicated repeatedly in the training data, and were only able to find 109 cases of memorization in Stable Diffusion in that 350k.
EDIT:
Conflict of Interest Disclosure: I'm a Stability.AI employee, and as such I have a financial interest in protecting the reputation of generative models generally and SD in particular. Read the paper for yourself. Everything here is my own personal opinion, and I am not speaking as a representative of Stability AI.
My reading is that yes: they demonstrated these models are clearly capable of memorizing images, but also that they are clearly capable of being trained in a way that makes them fairly robust to this phenomenon. Imagen has a higher capacity and was trained on much less data: it unsurprisingly is more prone to memorization. SD was trained on a massive dataset and has a smaller capacity: after constraining attention to the content we think it had the best excuse to have memorized, it barely memorized any of it.
There's almost certainly a scaling law here, and finding it will permit us to be even more principled about robustness to memorization. My personal reading of this experiment is that SD is probably pretty close to the pareto boundary here, and we could probably flush out the memorization phenomenon entirely if we train it on more data or trim away at the capacity tinker with the model's topology.
Nhabls t1_j6uokwb wrote
It's incredibly easy to make giant LLMs regurgitate training data near verbatim. There's very little reason to believe that this won't just start happening more frequently with image models as they grow in scale as well.
Personally i just hope it brings a reality check in the courts to these companies that think they can just monetize generative models trained on copyrighted material without permission
ItsJustMeJerk t1_j6uqkv6 wrote
Actually, data has shown after a certain size larger models end up generalizing more than smaller ones. It's called double descent.
Nhabls t1_j6urk1b wrote
This isn't really relevant. Newer, larger LLMs generalize better than smaller ones yet they also regurgitate training data better. it's not exclusive
ItsJustMeJerk t1_j6uymag wrote
You're right, it's not exclusive. But I believe that while the the absolute amount of data memorized might go up with scale, it occupies a smaller fraction of the output because it's only used where verbatim recitation is necessary instead of as a crutch (I could be wrong though). Anyway, I don't think that crippling the model by removing all copyrighted data from the dataset is a good long-term solution. You don't keep students from plagiarizing by preventing them from looking at a source related to what they're writing.
DigThatData t1_j6uu82y wrote
This is true, and also generalization and memorization are not mutually exclusive.
EDIT: I can't think of a better way to articulate this, but the image that keeps coming to my mind is a model memorizing the full training data and simulating a nearest neighbors estimate.
pm_me_your_pay_slips OP t1_j6wn43x wrote
That models that memorize better generalize better has been observed in large language models:
https://arxiv.org/pdf/2202.07646.pdf
https://arxiv.org/pdf/2205.10770.pdf
An interesting way to quantify memorization is proposed here, although it will be expensive for a model like SD: https://proceedings.neurips.cc/paper/2021/file/eae15aabaa768ae4a5993a8a4f4fa6e4-Paper.pdf.
Basically: you perform K-fold cross validation and measure how much more likely the image is when included in the training dataset vs when it is not included. For memorized images, the likelihood of the images when not used in the dataset drops to close to zero. Note that they caution against using the nearest neighbour distance to quantify memorization as it is not correlated with the described memorization score.
DigThatData t1_j6xexyf wrote
> That models that memorize better generalize better has been observed in large language models
I think this is an incorrect reading here. increasing model capacity is a reliable strategy for increasing generalization (Kaplan et al 2020, Scaling Laws), and larger capacity models have a higher propensity to memorize (your citations). The correlations discussed in both of those links are to capacity specifically, not generalization ability broadly. scaling law research has recently been demonstrating that there is probably a lot of wasted capacity in certain architectures, which suggests that the generalization potential of those models could be achieved with a much lower potential for memorization. see for example Tirumala et al 2022, Chinchilla.
which is to say: you're not wrong that a lot of recently trained models that generalize well have also been observed to memorize. but I don't think it's accurate to suggest that the reason these models generalize well is linked to a propensity/ability to memorize. it's possible this is the case, but I don't think anything suggesting this has been demonstrated. it seems more likely that generalization and memorization are correlated through the confounder of capacity, and contemporary research is actively attacking the problem of excess capacity in part to address the memorization question specifically.
EDIT: Also... I have some mixed feelings about that last paper. It's new to me and I just woke up so I'll have to take another look after I've had some coffee, but although their approach feels intuitively sound from the direction of the LOO methodology, their probabilistic formulation of memorization I think is problematic. They formalize memorization using a definition that appears to me to be indistinguishable from an operational definition of generalizability. Not even OOD generalizability: perfectly reasonable in-distribution generalization to unseen data, according to these researchers, would have the same properties as memorization. That's... not helpful. Anyway, need to read this closer, but "lower posterior likelihood" to me seems fundamentally different from "memorized". Their approach appears to make no effort to distinguish between a model that had "memorized" a training datum and one that had "learned" meaningful features in the neighborhood of a datum that has high [leverage](https://en.wikipedia.org/wiki/Leverage_(statistics). Are they detecting memorization or outlier samples? If the "outliers" are valid in distribution samples, removing them harms the diversity of the dataset and the model may have significantly less opportunity to learn features in the neighborhood of those observations (i.e. they are high leverage). My understanding is that the problem of memorization is generally more pathological in high density regions of the data, which would be undetectable by their approach.
pm_me_your_pay_slips OP t1_j6yl0wq wrote
The first paper proposes a way of quantifying memorization by looking at pairs of prefixes and postfixes and observing whether the postfixes wer generated by the model when the prefixes were used as prompts.
The second paper has this to say about generalization:
> A natural question at this point is to ask why larger models memorize faster? Typically, memorization is associated with overfitting, which offers a potentially simple explanation. In order to disentangle memorization from overfitting, we examine memorization before overfitting occurs, where we define overfitting occurring as the first epoch when the perplexity of the language model on a validation set increases. Surprisingly, we see in Figure 4 that as we increase the number of parameters, memorization before overfitting generally increases, indicating that overfitting by itself cannot completely explain the properties of memorization dynamics as model scale increases.
In fact, this is the title of the paper: "Memorization without overfitting".
> Anyway, need to read this closer, but "lower posterior likelihood" to me seems fundamentally different from "memorized".
The memorization score is not "lower posterior likelihood", but the log density ratio for a sample: log( p(sample| dataset including sample)/p(sample| dataset excluding sample) ) . Thus, a high memorization score is given to samples that go from very unlikely when not included to as likely as the average sample when included in the training data, or from as likely as the average training sample when not included in the training data to above-average likelihood when included.
DigThatData t1_j6ynesq wrote
> p(sample| dataset including sample)/p(sample| dataset excluding sample) )
which, like I said, is basically identical to statistical leverage. If you haven't seen it before, you can compute LOOCV for a regression model directly from the hat matrix (which is another name for the matrix of leverage values). This isn't a good definition for "memorization" because it's indistinguishable from how we define outliers.
> What's the definition of memorization here? how do we measure it?
I'd argue that what's at issue here is differentiating between memorization and learning. My concern regarding the density ratio here is that a model that had learned to generalize well in the neighborhood of the observation in question would behave the same way, so this definition of memorization doesn't differentiate between memorization and learning, which I think effectively renders it useless.
I don't love everything about the paper you linked in the OP, but I think they're on the right track by defining their "memorization" measure by probing the model's ability to regenerate presumably memorized data, especially since our main concern wrt memorization is in regards to the model reproducing memorized values.
pm_me_your_pay_slips OP t1_j6ypajq wrote
>This isn't a good definition for "memorization" because it's indistinguishable from how we define outliers.
The paper has this to say about your point
> If highly memorized observations are always given a low probability when they are included in the training data, then it would be straightforward to dismiss them as outliers that the model recognizes as such. However, we find that this is not universally the case for highly memorized observations, and a sizable proportion of them are likely only when they are included in the training data.
> Figure 3a shows the number of highly memorized and “regular” observations for bins of the log probability under the VAE model for CelebA, as well as example observations from both groups for different bins. Moreover, Figure 3b shows the proportion of highly memorized observations in each of the bins of the log probability under the model. While the latter figure shows that observations with low probability are more likely to be memorized, the former shows that a considerable proportion of highly memorized observations are as likely as regular observations when they are included in the training set. Indeed, more than half the highly memorized observations fall within the central 90% of log probability values.
TLDR if this method was giving you a high score to outliers only, then these samples would have low likelihood when they were included in the training data (because they are outliers). But the authors observed sizeable proportion of the samples with high memorization score to be as likely as regular (inlier) data.
A_fellow t1_j6xq9gj wrote
Pretending stability had or will have any principles other than profit is laughable.
DigThatData t1_j6y35x2 wrote
It's a startup that evolved out of a community of people who found each other through common interests in open source machine learning for public good (i.e. eleuther and laion), committed to providing the public with access to ML tools that were otherwise gated by corporate paywalls. For several years, that work was all being done by volunteers in their free time. We're barely a year old as an actual company and we're not perfect. But as far as intentions and integrity go: you're talking about a group of people who were essentially already functioning as a volunteer run non-profit, and then were given the opportunity to continue that work with a salary, benefits, and resources.
If profit was our chief concern, we wouldn't be giving these models away for free. Simple as that. There're plenty of valid criticisms you could lob our way, but a lack of principles and greed aren't among them. You might not like the way we do things or certain choices we've made, but if you think the intentions behind those decisions is primarily profit motivated: you should really learn more about the people you are criticizing, because you couldn't be more misinformed.
[deleted] t1_j7cyayu wrote
[removed]
Viewing a single comment thread. View all comments