dasayan05 t1_iz2ucqp wrote on December 6, 2022 at 1:29 AM

I have trained DDPM (not SDE) on CIFAR10 using 4 3090s with effective batch size of 1024. Took ~150k iterations (not epochs) and about 1.5 days to reach FID 2.8 (not really SOTA, but works).

dasayan05 t1_ivpmx7r wrote on November 9, 2022 at 6:14 PM

Reply to [Discussion] Could someone explain the math behind the number of distinct images that can be generated with a latent diffusion model? by [deleted]

There is no way to feasibly compute what you are asking for.

Diffusion Models (in fact any modern generative model) are defined on continuous image-space, i.e. a continuous vector of 512x512 length. This space is not discrete, so there isn't even any notion of "distinct images". A tiny continuous change can lead to another plausible image and there are (theoretically) infinitely many tiny change you can apply on an image to produce another image that looks same but isn't the same point in image space.

The (theoretically) correct answer to your question would be that there are infitiely many images you can sample from a given generative model.

dasayan05 OP t1_ivcwv24 wrote on November 7, 2022 at 1:06 AM

Reply to comment by johnnycrab_whoop in [D] ICLR 2023 reviews are out. How was your experience ? by dasayan05

go to openreview's Tasks tab and click on pending task .. you will see the time countdown.

dasayan05 t1_iumd15y wrote on November 1, 2022 at 1:10 PM

Reply to [P] Need pretrained EBMs for benchmarks by anomaly_in_testset

EBM is a broad category of models that uses energy functions. Which one(s) do you need?

dasayan05 t1_it95xq7 wrote on October 21, 2022 at 9:07 PM

Reply to comment by UncleVesem1r in [D] DDPM vs Score Matching by WallabyDue2778

IMO, forward diffusion process isn't really a "process" -- it's need not be sequential, it's parallelizable. The sole purpose of forward process is simulating noisy data from a set of "noisy data distributions" crafted with a known set of noise-scales -- that's it. SBM and DDPM both have this process. For SBMs, it is again a heuristic HP to choose the correct largest scale so that it can overpower the data variance and reach an uninformative prior. For DDPM, it always reaches the prior due to the way noise-scales and attenuation coefficients are computed from \beta_t.

Agree with your second part. SDE formulation is good -- it basically brings SBMs into a more stronger theoretical framework. SDEs offer a reverse process which is analytic where the score naturally appears -- i.e. again not much HP.

dasayan05 t1_it6u4ho wrote on October 21, 2022 at 10:58 AM

Reply to comment by WallabyDue2778 in [D] DDPM vs Score Matching by WallabyDue2778

To clarify, "score matching" itself is quite theoretically grounded -- what is not, is the fact that score matching and langevin dymanics is not theoretically "coupled". Langevin dynamics is chosen more like an intuitive way of "using" the score-estimates. Moreover, langevin dynamics theretically takes infinite time to reach the true distribution and it's convergence depends on proper choice of `\delta`, a tiny number that acts like step size.

x_{t-1} = x_t + s(x_t, t) \delta / 2 + sqrt{\delta} z

Now, look at DDPM. DDPM's training objective is totally "coupled" with it's sampling process -- it all comes from very standard calculations on the underlying PGM (probabilistic graphical model). Notice that DDPMs reverse process do not involve a hyperparam like `\delta`, everything is tied to the known \beta schedule -- which tells you what exact step size to take in order to converge in finitely many (T) steps. DDPM's reverse process is not langevin dynamics -- it just looks like it, but has stronger gurantee on convergence.

This makes it more robust compared to Score based langevin dynamics.

dasayan05 t1_it46pby wrote on October 20, 2022 at 8:22 PM

Reply to [D] DDPM vs Score Matching by WallabyDue2778

>... these seem to be two dominant approaches ...

Totally. There are two streams of ideas, similar but not exactly equivalent, namely Score-Based Models (SBM) and Denoising Diffusion Probabilistic Models (DDPM). There is an effort to unify these two under the umbrella of Stochastic Differential Equations (SDE), where SBM -> "Variance Exploding SDE" and DDPM -> "Variance Preserving SDE". By far, DDPM is more famous -- reason is, DDPM has stronger theoretical gurantees and less hyperparameters. SBMs are, in some parts, intuitive and observation-based.

>.. they learn the noise rather than the score ..

Yes. SBM uses "score" while DDPM uses "noise-estimates"; but they are related -- "score = - eps / noise-std" see CVPR22's Diffusion slides (slide 57). IMO, the major difference between SBM and DDPM is their forward noising process -- SBM only adds noise -- DDPM adds noise as well as attenuates the signal and this process is systematically "tied" to the noise schedule \beta_t. This makes the reverse process look slightly different.

If you want to implement Diffusion Models, start with DDPM as formulated by Ho et al. I have never seen an algorithm written so clearly as the one in Ho et al's Algorithm 1 & 2. It can't get any simpler in terms of implementation.

dasayan05 t1_irrfd7w wrote on October 10, 2022 at 1:59 PM

Reply to [D] Recent ML papers to implement from scratch by nullspace1729

It doesn't matter which one you implement. Trying to implement anything from scratch always exposes you to deeper insights which is hard to get by looking at dry mathematics on paper. Just one advice: pick a paper/algo that is well-known to work and reproducible. Then you are good.

dasayan05 t1_ir9d9s3 wrote on October 6, 2022 at 9:10 AM

Reply to comment by golljj in [R] DDIM Reconstruction Confusion by [deleted]

Your first part of the statement is correct -- that is called the "forward process" and it is only needed at training time.

Yes, the encoder in DDIM is basically adding a predicted-noise to travel back to x_T -- it's more like the "reverse of the reverse process", but we can't really call it the "forward process", can we? For example, the true "forward process" is almost entire random and you can skip to any x_t by re-parameterization. This isn't true for DDIM's "reverse of the reverse process" -- it must be sequential and deterministic.

dasayan05 t1_ir5q57s wrote on October 5, 2022 at 3:12 PM

Reply to comment by carlml in [D] What happens when the reconstruction is fed back to the VAE? by carlml

Yes, I understand what you mean.
I am asking whether feeding back it's output has any special interpretation in terms of VAE ? Is there any rationale behind doing this ? Are you expecting something specific from this ?

dasayan05 t1_ir4kr52 wrote on October 5, 2022 at 8:32 AM

Reply to [D] What happens when the reconstruction is fed back to the VAE? by carlml

I don't the answer but does this "feeding back it's reconstruction" has any meaning/interpretation ?

dasayan05 t1_ir4kcfg wrote on October 5, 2022 at 8:26 AM

Reply to comment by golljj in [R] DDIM Reconstruction Confusion by [deleted]

Sorry, but that's not really the correct interpretation. The "forward process" is not the encoder -- it's a stochastic process. The encoder is the "reverse of eq.14", which is integrating the ODE in eq.14 backwards in time -- that is not same as the "forward process".

dasayan05 t1_iqy7x8z wrote on October 3, 2022 at 11:35 PM

Reply to [R] DDIM Reconstruction Confusion by [deleted]

Yes, you get the noise from the U-Net itself.

The idea behind DDIM is to make the reverse process deterministic, i.e. conveting the SDE into an ODE (eq. 14). Now that said, an ODE can be integrated backwards in time starting from final solution (the clean image x_0), integrating with negative `dt`, reaching at noise (i.e. "encoded feature") x_T. Thus, you get a negative sign in front of the nosie-estimator `\epsilon_{\theta}` and then treat it like a normal ODE and integrate from end-time (t=0) to start-time (t=T).

dasayan05 t1_iqrlbf7 wrote on October 2, 2022 at 4:19 PM

Reply to [D] Hidden unit connected to each other in a single layer by [deleted]

where did you get this arch from ? any reference ? it's not clear what the meaning of this is.

dasayan05 t1_iqp8hnf wrote on October 2, 2022 at 2:14 AM

Reply to comment by 029187 in [Discussion] If we had enough memory to always do full batch gradient descent, would we still need rmsprop/momentum/adam? by 029187

possible. but what is the advantage with that ? even if we did find a way to explicitly noise the data/gradient, we are still better off with mini-batches as they offer less memory consumption

dasayan05 t1_iqnltrg wrote on October 1, 2022 at 6:43 PM

Reply to [Discussion] If we had enough memory to always do full batch gradient descent, would we still need rmsprop/momentum/adam? by 029187

Mini-batches are not here just for memory limitations. They inject noise in the optimization which helps escape local minimas and explores the loss landscape.