Submitted by ZeronixSama t3_yieq8c in MachineLearning
Red-Portal t1_iuipmo7 wrote
There are practical rather than theoretical differences. In the diffusion setting, you don't have access to the "log likelihood", only its derivatives. So you can't really use fancy MCMC tricks, and you're stuck with variants of unadjusted Langevin.
ZeronixSama OP t1_iuirypd wrote
Thanks, that makes sense. I'm still a bit confused - is there something about the diffusion method that precludes accessing the log likelihood, or is it just that in typical generative settings the log likelihood is intractable to model?
Red-Portal t1_iuise0p wrote
The whole point of diffusion models (or actually, score modelling) is to go around having to learn the log likelihood. So having access to the likelihood kindda defeats the point.
ZeronixSama OP t1_iuitqgd wrote
Ok, I think this blog post helped me understand: https://lilianweng.github.io/posts/2021-07-11-diffusion-models/
Essentially the idea is that tractable log likelihoods are usually not flexible enough to capture rich structure in datasets and vice versa. So explicitly trying to model the log likelihood for such datasets is a doomed endeavour, but modelling the gradient of log likelihood is both tractable and flexible 'enough' to be practically useful.
P.S. That does make me wonder, if it's turtles all the way down... In a sense, distributions whose grad(log-likelihood) can be tractably modelled could also argued to be less flexible than distributions which don't fall within this class, and so in the future there may be some second-order diffusion method that operates on the grad(grad(log-likelihood)) instead. Downside is huge compute required for second derivative, but upside could be much more flexible modelling capability
Red-Portal t1_iuiv69n wrote
The intuition is actually simpler in my opinion. Modeling the likelihood in a non-parametric fashion is basically density estimation. The problem is that density estimation in dimensions higher than 2 is a problem well known to be difficult. Especially since you need to accurately estimate the absolute probability density values over the whole space, which need to be globally consistent. In contrast, the score only cares about relative density values, which are local properties. So that's an easier problem. However, you now need your local information to cover enough space, which is done through annealing/tempering.
Viewing a single comment thread. View all comments