Essentially the idea is that tractable log likelihoods are usually not flexible enough to capture rich structure in datasets and vice versa. So explicitly trying to model the log likelihood for such datasets is a doomed endeavour, but modelling the gradient of log likelihood is both tractable and flexible 'enough' to be practically useful.
P.S. That does make me wonder, if it's turtles all the way down... In a sense, distributions whose grad(log-likelihood) can be tractably modelled could also argued to be less flexible than distributions which don't fall within this class, and so in the future there may be some second-order diffusion method that operates on the grad(grad(log-likelihood)) instead. Downside is huge compute required for second derivative, but upside could be much more flexible modelling capability
Thanks, that makes sense. I'm still a bit confused - is there something about the diffusion method that precludes accessing the log likelihood, or is it just that in typical generative settings the log likelihood is intractable to model?
That blog post is amazing! Exactly what I was looking for, thanks very much. At a high level, it seems that MCMC sampling can, in fact, be used to improve diffusion models' generative capabilities
ZeronixSama OP t1_iuitqgd wrote
Reply to comment by Red-Portal in [D] Diffusion vs MCMC as sampling algorithms by ZeronixSama
Ok, I think this blog post helped me understand: https://lilianweng.github.io/posts/2021-07-11-diffusion-models/
Essentially the idea is that tractable log likelihoods are usually not flexible enough to capture rich structure in datasets and vice versa. So explicitly trying to model the log likelihood for such datasets is a doomed endeavour, but modelling the gradient of log likelihood is both tractable and flexible 'enough' to be practically useful.
P.S. That does make me wonder, if it's turtles all the way down... In a sense, distributions whose grad(log-likelihood) can be tractably modelled could also argued to be less flexible than distributions which don't fall within this class, and so in the future there may be some second-order diffusion method that operates on the grad(grad(log-likelihood)) instead. Downside is huge compute required for second derivative, but upside could be much more flexible modelling capability