Submitted by WallabyDue2778 t3_y92tln in MachineLearning

I’m reading up on diffusion models, and these seem to be two dominant approaches. They are also equivalent in some parameterization of the formulations. However, the more recent papers, for example stable diffusion, seem to use DDPM-type formulation more often, and by this I mean they learn the noise rather than the score.

Is this observation true? And if it is, what are some reasons? I’ve never implemented a model like this myself, so I don’t know how difficult or practical they are. Perhaps all the issues listed in the score matching papers (manifold hypothesis, low data density regions, inaccurate score estimation) make it really difficult to work with, or is there something more fundamental?

Thanks in advance!

7

Comments

You must log in or register to comment.

dasayan05 t1_it46pby wrote

>... these seem to be two dominant approaches ...

Totally. There are two streams of ideas, similar but not exactly equivalent, namely Score-Based Models (SBM) and Denoising Diffusion Probabilistic Models (DDPM). There is an effort to unify these two under the umbrella of Stochastic Differential Equations (SDE), where SBM -> "Variance Exploding SDE" and DDPM -> "Variance Preserving SDE". By far, DDPM is more famous -- reason is, DDPM has stronger theoretical gurantees and less hyperparameters. SBMs are, in some parts, intuitive and observation-based.

>.. they learn the noise rather than the score ..

Yes. SBM uses "score" while DDPM uses "noise-estimates"; but they are related -- "score = - eps / noise-std" see CVPR22's Diffusion slides (slide 57). IMO, the major difference between SBM and DDPM is their forward noising process -- SBM only adds noise -- DDPM adds noise as well as attenuates the signal and this process is systematically "tied" to the noise schedule \beta_t. This makes the reverse process look slightly different.

If you want to implement Diffusion Models, start with DDPM as formulated by Ho et al. I have never seen an algorithm written so clearly as the one in Ho et al's Algorithm 1 & 2. It can't get any simpler in terms of implementation.

9

WallabyDue2778 OP t1_it4zvey wrote

Thank you for your reply.

I totally agree that DDPM would be simpler to implement (I never did this but it seemed more straightforward). But my impression was that score matching was more theoretically grounded than DDPM. The derivation of the objective function of score matching, from the gradient-ascent like Langevin dynamics, to the various approximations to the “target term” in the norm (like grad_x log q(x_tilde|x)) to me feels more sound than the DDPM, which to me felt like arbitrarily saying let’s add noise and attenuate, and assume the reverse is also Gaussian like, and let’s just use a model to learn it, and actually we found that learning the noise is better than learning the mean, for whatever reason so let’s do that.

(I don’t mean to belittle the authors’ work. I have never been able to derive and conduct such research)

I do admit that my impression may be due to the fact that I didn’t understand the derivation of those approximations in score matching. And it’s highly likely I don’t know what I’m talking about regarding DDPM.

Would you please give an example where SBM is intuitive and observation based? I think the first paper, where they discussed a bunch of pitfalls and then came up with using various noise levels and the noise conditioned model seems that way.

3

Red-Portal t1_it516cq wrote

I actually think it's the opposite. Although the "learning the noise" part is voodoo, the probabilistic model itself is quite sound, if you're slightly Bayesian. What DDPM is doing is that, assuming the transition is Gaussian, then let's find this Gaussian. There's nothing inherently wrong about this since you're conditioning on the assumption. I have a problem with the "let's learn the noise like psychopaths" part too; but I think it has something to do with scaling of the variational objective. Score matching on the other hand, has no theoretical guarantee that it will produce something accurate enough to be used for Langevin sampling.

3

UncleVesem1r t1_it5bxzm wrote

Thank you for the reply. It was very helpful.

​

>Score matching on the other hand, has no theoretical guarantee that it will produce something accurate enough to be used for Langevin sampling.

Sorry if I'm being dense. Could you expound on this? Or could you be more explicit regarding which part of DDPM provides such theoretical guarantee, while SBM fails to do so, perhaps providing eq numbers from the papers? I'm fairly new to this and it's hard for me to parse all the equations and understand which part is fluff and which is the real meat. Thank you very much!

1

Red-Portal t1_it5ciyb wrote

DDPM doesn't aim to produce anything related to Langevin sampling. However, it's objective function is equivalent to the KL divergence between the "true" Gaussian and the neural network parameterized Gaussian. Thus, as long as SGD does optimize the DDPM objective, you'll get something that is close to the true Gaussian according to the KL divergence. The problem is that learning the noise with MSE kindda ruins all of this...

1

UncleVesem1r t1_it5rffe wrote

I see! I understand why DDPM is good now. I should go back to the paper and pay more attention to the KL divergence part of it.

If I could borrow a few more minutes of your time, could you explain more about what's not as good about score matching?

So to be explicit, my understanding Langevin sampling is correct, i.e., if there's a model that can accurately model the score function, one should be able to recover the true data distribution. If this is true, then I guess the criticism regarding SM is about its objective function, i.e., there's no guarantee that it leads to accurate score function? But aren't the score matching algorithms (denoising, projection) supposed to be able to solve the objective function involving grad_x log p(x)?

Or perhaps Langevin sampling is the problem. The paper does say that with small enough noise and enough steps, we would end up in an exact sample from the data set. Yet if we don't have small enough noise and enough steps, perhaps we end up somewhere but it doesn't guarantee to be the true data distribution?

I really appreciate this! Thanks again.

1

Red-Portal t1_it5v31k wrote

>there's no guarantee that it leads to accurate score function? But aren't the score matching algorithms (denoising, projection) supposed to be able to solve the objective function involving grad_x log p(x)?

Oh no it's not. All it's doing is to minimize the mean-squares error against the score function. Minimizing this objective does not mean sampling using this score function will be a good idea; which it isn't. This is exactly why score modelling has to rely on adding noise. And by doing this, they converged to DDPM.

1

dasayan05 t1_it6u4ho wrote

To clarify, "score matching" itself is quite theoretically grounded -- what is not, is the fact that score matching and langevin dymanics is not theoretically "coupled". Langevin dynamics is chosen more like an intuitive way of "using" the score-estimates. Moreover, langevin dynamics theretically takes infinite time to reach the true distribution and it's convergence depends on proper choice of `\delta`, a tiny number that acts like step size.

x_{t-1} = x_t + s(x_t, t) \delta / 2 + sqrt{\delta} z

Now, look at DDPM. DDPM's training objective is totally "coupled" with it's sampling process -- it all comes from very standard calculations on the underlying PGM (probabilistic graphical model). Notice that DDPMs reverse process do not involve a hyperparam like `\delta`, everything is tied to the known \beta schedule -- which tells you what exact step size to take in order to converge in finitely many (T) steps. DDPM's reverse process is not langevin dynamics -- it just looks like it, but has stronger gurantee on convergence.

This makes it more robust compared to Score based langevin dynamics.

1

UncleVesem1r t1_it801hk wrote

Thank you! My intuition was that score matching + Langevin doesn’t have a forward diffusion process, which probably contributed to why there has to be a step size (right?) and I agree that LD seemed to be an easy way to use the scores.

How about the SDE formulation of score matching? They also claimed that DDPM is a variance preserving discretization SDE. As far as I can tell, the reverse SDE is a closed form solution of forward SDE and doesn’t require extra hyper parameters.

1

dasayan05 t1_it95xq7 wrote

IMO, forward diffusion process isn't really a "process" -- it's need not be sequential, it's parallelizable. The sole purpose of forward process is simulating noisy data from a set of "noisy data distributions" crafted with a known set of noise-scales -- that's it. SBM and DDPM both have this process. For SBMs, it is again a heuristic HP to choose the correct largest scale so that it can overpower the data variance and reach an uninformative prior. For DDPM, it always reaches the prior due to the way noise-scales and attenuation coefficients are computed from \beta_t.

Agree with your second part. SDE formulation is good -- it basically brings SBMs into a more stronger theoretical framework. SDEs offer a reverse process which is analytic where the score naturally appears -- i.e. again not much HP.

1