pm_me_your_pay_slips OP t1_j6ypajq wrote on February 2, 2023 at 8:50 PM

Reply to comment by DigThatData in [R] Extracting Training Data from Diffusion Models by pm_me_your_pay_slips

>This isn't a good definition for "memorization" because it's indistinguishable from how we define outliers.

The paper has this to say about your point

> If highly memorized observations are always given a low probability when they are included in the training data, then it would be straightforward to dismiss them as outliers that the model recognizes as such. However, we find that this is not universally the case for highly memorized observations, and a sizable proportion of them are likely only when they are included in the training data.

> Figure 3a shows the number of highly memorized and “regular” observations for bins of the log probability under the VAE model for CelebA, as well as example observations from both groups for different bins. Moreover, Figure 3b shows the proportion of highly memorized observations in each of the bins of the log probability under the model. While the latter figure shows that observations with low probability are more likely to be memorized, the former shows that a considerable proportion of highly memorized observations are as likely as regular observations when they are included in the training set. Indeed, more than half the highly memorized observations fall within the central 90% of log probability values.

TLDR if this method was giving you a high score to outliers only, then these samples would have low likelihood when they were included in the training data (because they are outliers). But the authors observed sizeable proportion of the samples with high memorization score to be as likely as regular (inlier) data.