I was recently reviewing the diffusion methods used in Stable Diffusion and my mind wandered to Markov Chain Monte Carlo, which got me thinking - are there important theoretical similarities / differences between these methods?

A bit of background:

Intro to Stable Diffusion: A nice illustrated guide by Jay Alammar https://jalammar.github.io/illustrated-stable-diffusion/
Intro to MCMC: Stanford CS168 notes by Tim Roughgarden and Gregory Valiant http://timroughgarden.org/s17/l/l14.pdf
The Metropolis-Hastings (MH) algorithm, a specific MCMC algorithm: https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm

My own stream-of-consciousness thoughts:

Function: Both diffusion and MH are sampling-based generative models that learns to produce data from a given distribution.
Iterative sampling: Diffusion works by predicting a de-noising term, which is progressively added to a random noise sample. This is similar to the MH proposal function g(x' | x) which generates candidate next state in the Markov chain. For diffusion, it converges to the training distribution, whereas MCMC converges to the stationary distribution.
Biasing: Diffusion can be conditioned on exact boundary conditions, or 'guided' towards certain types of outputs by modifying the diffusion gradient in the vein of Janner et al. In the same way the MH algorithm can respect hard and soft constraints by configuring the acceptance ratio f(x') / f(x) to encode the desired properties of the stationary distribution
Overall, we may say that in the literature, diffusion is largely 'learned' whereas MCMC is 'designed', but reversing that scenario may introduce interesting results (in terms of learnable MCMC or designed diffusion).

This thought raises a lot of important questions for me.

Can we interpret diffusion as some variant of MCMC and therefore derive theoretical properties of it?
The basic discussion above analyses MH and diffusion in terms of 2 properties, which are their iterative sampling and biasing procedures. Can we 'mix-and-match' to get new algorithms which might be better?
What are the important theoretical differences between these two methods?

It's not clear to me what the answers are, I'm hoping to have a good discussion with the smart minds in this forum!

Comments

You must log in or register to comment.

HateRedditCantQuitit t1_iuif6gz wrote on October 31, 2022 at 3:56 PM

Take a look around the connections between diffusion and score models. At that point, diffusion is annealed MCMC. The score model side of the lit seems to be deep in the MCMC side of things. A starting point would be Yang Song’s blog post on score models. https://yang-song.net/blog/2021/score/

ZeronixSama OP t1_iuinl5o wrote on October 31, 2022 at 4:51 PM

That blog post is amazing! Exactly what I was looking for, thanks very much. At a high level, it seems that MCMC sampling can, in fact, be used to improve diffusion models' generative capabilities

ccrdallas t1_iuimqp9 wrote on October 31, 2022 at 4:46 PM

Diffusion models essentially use a stochastic differential equation to produce samples. An MCMC method that produces samples using such equations, often known as Langevin diffusions, are called the Unadjusted/Metropolis-Adjusted Langevin algorithm(s).

Typically we don’t use the metropolis adjustment in practice and so these methods aren’t truly Metropolis-Hastings algorithms, they are just discretized SDEs. The purpose of the adjustment is to reduce bias and ensure convergence to a particular distribution but these are lesser concerns in diffusion models, particularly with small enough step size.

Red-Portal t1_iuipcuj wrote on October 31, 2022 at 5:03 PM

>are lesser concerns in diffusion models

Lesser concern because we don't care about the bias that much as long as the samples look okay. However small the step size, the bias will be there. Actually, unadjusted Langevin is pretty terrible for statistical applications in terms of bias even with small stepsizes.

ZeronixSama OP t1_iuiqs8u wrote on October 31, 2022 at 5:13 PM

I see, so in other words: in diffusion models there is no rejection sampling step that the Metropolis-Hastings algorithm would usually do.

deschaussures147 t1_iukad3s wrote on October 31, 2022 at 11:46 PM

The answer is yes, there is a strong connection between the SDE using in sampling for diffusion models and MCMC. For this there are already some theoretical results in the convergence of the sampling in diffusion models, e.g. :

This NeuRIPS 22 paper: https://arxiv.org/abs/2206.06227
https://arxiv.org/abs/2209.11215

Edit: see my other reply as well.

Red-Portal t1_iuipmo7 wrote on October 31, 2022 at 5:05 PM

There are practical rather than theoretical differences. In the diffusion setting, you don't have access to the "log likelihood", only its derivatives. So you can't really use fancy MCMC tricks, and you're stuck with variants of unadjusted Langevin.

ZeronixSama OP t1_iuirypd wrote on October 31, 2022 at 5:20 PM

Thanks, that makes sense. I'm still a bit confused - is there something about the diffusion method that precludes accessing the log likelihood, or is it just that in typical generative settings the log likelihood is intractable to model?

Red-Portal t1_iuise0p wrote on October 31, 2022 at 5:23 PM

The whole point of diffusion models (or actually, score modelling) is to go around having to learn the log likelihood. So having access to the likelihood kindda defeats the point.

ZeronixSama OP t1_iuitqgd wrote on October 31, 2022 at 5:32 PM

Ok, I think this blog post helped me understand: https://lilianweng.github.io/posts/2021-07-11-diffusion-models/

Essentially the idea is that tractable log likelihoods are usually not flexible enough to capture rich structure in datasets and vice versa. So explicitly trying to model the log likelihood for such datasets is a doomed endeavour, but modelling the gradient of log likelihood is both tractable and flexible 'enough' to be practically useful.

P.S. That does make me wonder, if it's turtles all the way down... In a sense, distributions whose grad(log-likelihood) can be tractably modelled could also argued to be less flexible than distributions which don't fall within this class, and so in the future there may be some second-order diffusion method that operates on the grad(grad(log-likelihood)) instead. Downside is huge compute required for second derivative, but upside could be much more flexible modelling capability

Red-Portal t1_iuiv69n wrote on October 31, 2022 at 5:42 PM

The intuition is actually simpler in my opinion. Modeling the likelihood in a non-parametric fashion is basically density estimation. The problem is that density estimation in dimensions higher than 2 is a problem well known to be difficult. Especially since you need to accurately estimate the absolute probability density values over the whole space, which need to be globally consistent. In contrast, the score only cares about relative density values, which are local properties. So that's an easier problem. However, you now need your local information to cover enough space, which is done through annealing/tempering.

CatalyzeX_code_bot t1_iui8kht wrote on October 31, 2022 at 3:10 PM

Found relevant code at https://diffusion-planning.github.io/ + all code implementations here

To opt out from receiving code links, DM me

maizeq t1_iujf76j wrote on October 31, 2022 at 7:55 PM

The sampling method used with diffusion/score models is in fact a type of approximate MCMC. As another commentator mentioned, it’s the result of discretising (hence approximate) an SDE that has the log data probability (under the model) as its equilibrium distribution.

The advantage of Langevin sampling methods vs a method like Metropolis-Hastings is better efficiency (lower mixing time), because it reduces random walk behaviour. It also scales better with higher dimensionality.

What made modern diffusion/score based models successful was combining this with a schedule of additive noise, and conditioning the score models on the scale of this noise (the time-step). This solved various problems with the traditional score matching objective (like poor performance in low density regions).

Red-Portal t1_iujl4c3 wrote on October 31, 2022 at 8:35 PM

>It also scales better with higher dimensionality.

I believe this is not true. I am not aware of results that show ULA is faster than MALA. In fact, I believe it's the complete opposite. This paper shows that MALA is much faster than ULA. Did the state of knowledge change recently?