benanne OP t1_j3qzbcr wrote on January 10, 2023 at 2:01 PM

Reply to comment by themrzmaster in [R] Diffusion language models by benanne

If you were to graph the weighting that ensures the training loss corresponds to likelihood, you would find that it looks roughly like exp(-x). In other words, the importance of the noise levels decreases more or less exponentially (but not exactly!) as they increase. So if you want to train a diffusion model to maximise likelihood (which can be a valid thing to do, for example if you want to use it for lossless compression), your training set should have many more examples of low noise levels than of high noise levels (orders of magnitude more, in fact).

Usually when we train diffusion models, we sample noise levels uniformly, or from a simple distribution, but certainly not from a distribution which puts exponentially more weight on low noise levels. Therefore, relative to the likelihood loss, the loss we tend to use puts a lot less emphasis on low noise levels, which correspond to high spatial frequencies. Section 5 of my earlier blog post is an attempt at an intuitive explanation why this correspondence between noise levels and spatial frequencies exists: https://benanne.github.io/2022/01/31/diffusion.html#scale

"Variational diffusion models" is another paper that focuses on optimising likelihood, which you might find more accessible: https://arxiv.org/abs/2107.00630

themrzmaster t1_j3rma2a wrote on January 10, 2023 at 4:35 PM

Thank you very much!