ZeronixSama OP t1_iuitqgd wrote on October 31, 2022 at 5:32 PM

Reply to comment by Red-Portal in [D] Diffusion vs MCMC as sampling algorithms by ZeronixSama

Ok, I think this blog post helped me understand: https://lilianweng.github.io/posts/2021-07-11-diffusion-models/

Essentially the idea is that tractable log likelihoods are usually not flexible enough to capture rich structure in datasets and vice versa. So explicitly trying to model the log likelihood for such datasets is a doomed endeavour, but modelling the gradient of log likelihood is both tractable and flexible 'enough' to be practically useful.

P.S. That does make me wonder, if it's turtles all the way down... In a sense, distributions whose grad(log-likelihood) can be tractably modelled could also argued to be less flexible than distributions which don't fall within this class, and so in the future there may be some second-order diffusion method that operates on the grad(grad(log-likelihood)) instead. Downside is huge compute required for second derivative, but upside could be much more flexible modelling capability

Red-Portal t1_iuiv69n wrote on October 31, 2022 at 5:42 PM

The intuition is actually simpler in my opinion. Modeling the likelihood in a non-parametric fashion is basically density estimation. The problem is that density estimation in dimensions higher than 2 is a problem well known to be difficult. Especially since you need to accurately estimate the absolute probability density values over the whole space, which need to be globally consistent. In contrast, the score only cares about relative density values, which are local properties. So that's an easier problem. However, you now need your local information to cover enough space, which is done through annealing/tempering.