dasayan05
dasayan05 OP t1_ixx046o wrote
Reply to comment by SkeeringReal in [D] ICLR 2023 reviews are out. How was your experience ? by dasayan05
Spotlight/Oral are mostly case by case decision and totally up to the ACs. I don't think you can get a general rule or anything
dasayan05 t1_iwzo45d wrote
Reply to [D] David Ha/@hardmaru of Stability AI is liking all of Elon Musk's tweets by datasciencepro
If there is anything toxic, it's (people like) you and this post.
You basically fall into the large pool of people who decides the state of the world just by reading news headlines.
dasayan05 t1_iwzmx7g wrote
Reply to comment by datasciencepro in [D] David Ha/@hardmaru of Stability AI is liking all of Elon Musk's tweets by datasciencepro
Francois Chollet is not the "Tech community", and neither is David Ha.
It's their personal opinion, everyone has their own.
dasayan05 t1_iwu4pts wrote
dasayan05 t1_iwi4lir wrote
I have trained DDPM (not SDE) on CIFAR10 using 4 3090s with effective batch size of 1024. Took ~150k iterations (not epochs) and about 1.5 days to reach FID 2.8 (not really SOTA, but works).
dasayan05 t1_ivpmx7r wrote
Reply to [Discussion] Could someone explain the math behind the number of distinct images that can be generated with a latent diffusion model? by [deleted]
There is no way to feasibly compute what you are asking for.
Diffusion Models (in fact any modern generative model) are defined on continuous image-space, i.e. a continuous vector of 512x512 length. This space is not discrete, so there isn't even any notion of "distinct images". A tiny continuous change can lead to another plausible image and there are (theoretically) infinitely many tiny change you can apply on an image to produce another image that looks same but isn't the same point in image space.
The (theoretically) correct answer to your question would be that there are infitiely many images you can sample from a given generative model.
dasayan05 OP t1_ivcwv24 wrote
Reply to comment by johnnycrab_whoop in [D] ICLR 2023 reviews are out. How was your experience ? by dasayan05
go to openreview's Tasks tab and click on pending task .. you will see the time countdown.
Submitted by dasayan05 t3_ymctqy in MachineLearning
dasayan05 t1_iumd15y wrote
EBM is a broad category of models that uses energy functions. Which one(s) do you need?
dasayan05 t1_it95xq7 wrote
Reply to comment by UncleVesem1r in [D] DDPM vs Score Matching by WallabyDue2778
IMO, forward diffusion process isn't really a "process" -- it's need not be sequential, it's parallelizable. The sole purpose of forward process is simulating noisy data from a set of "noisy data distributions" crafted with a known set of noise-scales -- that's it. SBM and DDPM both have this process. For SBMs, it is again a heuristic HP to choose the correct largest scale so that it can overpower the data variance and reach an uninformative prior. For DDPM, it always reaches the prior due to the way noise-scales and attenuation coefficients are computed from \beta_t.
Agree with your second part. SDE formulation is good -- it basically brings SBMs into a more stronger theoretical framework. SDEs offer a reverse process which is analytic where the score naturally appears -- i.e. again not much HP.
dasayan05 t1_it6u4ho wrote
Reply to comment by WallabyDue2778 in [D] DDPM vs Score Matching by WallabyDue2778
To clarify, "score matching" itself is quite theoretically grounded -- what is not, is the fact that score matching and langevin dymanics is not theoretically "coupled". Langevin dynamics is chosen more like an intuitive way of "using" the score-estimates. Moreover, langevin dynamics theretically takes infinite time to reach the true distribution and it's convergence depends on proper choice of `\delta`, a tiny number that acts like step size.
x_{t-1} = x_t + s(x_t, t) \delta / 2 + sqrt{\delta} z
Now, look at DDPM. DDPM's training objective is totally "coupled" with it's sampling process -- it all comes from very standard calculations on the underlying PGM (probabilistic graphical model). Notice that DDPMs reverse process do not involve a hyperparam like `\delta`, everything is tied to the known \beta schedule -- which tells you what exact step size to take in order to converge in finitely many (T) steps. DDPM's reverse process is not langevin dynamics -- it just looks like it, but has stronger gurantee on convergence.
This makes it more robust compared to Score based langevin dynamics.
dasayan05 t1_it46pby wrote
Reply to [D] DDPM vs Score Matching by WallabyDue2778
>... these seem to be two dominant approaches ...
Totally. There are two streams of ideas, similar but not exactly equivalent, namely Score-Based Models (SBM) and Denoising Diffusion Probabilistic Models (DDPM). There is an effort to unify these two under the umbrella of Stochastic Differential Equations (SDE), where SBM -> "Variance Exploding SDE" and DDPM -> "Variance Preserving SDE". By far, DDPM is more famous -- reason is, DDPM has stronger theoretical gurantees and less hyperparameters. SBMs are, in some parts, intuitive and observation-based.
>.. they learn the noise rather than the score ..
Yes. SBM uses "score" while DDPM uses "noise-estimates"; but they are related -- "score = - eps / noise-std" see CVPR22's Diffusion slides (slide 57). IMO, the major difference between SBM and DDPM is their forward noising process -- SBM only adds noise -- DDPM adds noise as well as attenuates the signal and this process is systematically "tied" to the noise schedule \beta_t. This makes the reverse process look slightly different.
If you want to implement Diffusion Models, start with DDPM as formulated by Ho et al. I have never seen an algorithm written so clearly as the one in Ho et al's Algorithm 1 & 2. It can't get any simpler in terms of implementation.
dasayan05 t1_irrfd7w wrote
It doesn't matter which one you implement. Trying to implement anything from scratch always exposes you to deeper insights which is hard to get by looking at dry mathematics on paper. Just one advice: pick a paper/algo that is well-known to work and reproducible. Then you are good.
dasayan05 t1_ir9d9s3 wrote
Reply to comment by golljj in [R] DDIM Reconstruction Confusion by [deleted]
Your first part of the statement is correct -- that is called the "forward process" and it is only needed at training time.
Yes, the encoder in DDIM is basically adding a predicted-noise to travel back to x_T -- it's more like the "reverse of the reverse process", but we can't really call it the "forward process", can we? For example, the true "forward process" is almost entire random and you can skip to any x_t by re-parameterization. This isn't true for DDIM's "reverse of the reverse process" -- it must be sequential and deterministic.
dasayan05 t1_ir5q57s wrote
Reply to comment by carlml in [D] What happens when the reconstruction is fed back to the VAE? by carlml
Yes, I understand what you mean.
I am asking whether feeding back it's output has any special interpretation in terms of VAE ? Is there any rationale behind doing this ? Are you expecting something specific from this ?
dasayan05 t1_ir4kr52 wrote
I don't the answer but does this "feeding back it's reconstruction" has any meaning/interpretation ?
dasayan05 t1_ir4kcfg wrote
Reply to comment by golljj in [R] DDIM Reconstruction Confusion by [deleted]
Sorry, but that's not really the correct interpretation. The "forward process" is not the encoder -- it's a stochastic process. The encoder is the "reverse of eq.14", which is integrating the ODE in eq.14 backwards in time -- that is not same as the "forward process".
dasayan05 t1_iqy7x8z wrote
Reply to [R] DDIM Reconstruction Confusion by [deleted]
Yes, you get the noise from the U-Net itself.
The idea behind DDIM is to make the reverse process deterministic, i.e. conveting the SDE into an ODE (eq. 14). Now that said, an ODE can be integrated backwards in time starting from final solution (the clean image x_0), integrating with negative `dt`, reaching at noise (i.e. "encoded feature") x_T. Thus, you get a negative sign in front of the nosie-estimator `\epsilon_{\theta}` and then treat it like a normal ODE and integrate from end-time (t=0) to start-time (t=T).
dasayan05 t1_iqrlbf7 wrote
where did you get this arch from ? any reference ? it's not clear what the meaning of this is.
dasayan05 t1_iqp8hnf wrote
Reply to comment by 029187 in [Discussion] If we had enough memory to always do full batch gradient descent, would we still need rmsprop/momentum/adam? by 029187
possible. but what is the advantage with that ? even if we did find a way to explicitly noise the data/gradient, we are still better off with mini-batches as they offer less memory consumption
dasayan05 t1_iqnltrg wrote
Reply to [Discussion] If we had enough memory to always do full batch gradient descent, would we still need rmsprop/momentum/adam? by 029187
Mini-batches are not here just for memory limitations. They inject noise in the optimization which helps escape local minimas and explores the loss landscape.
dasayan05 t1_iz2ucqp wrote
Reply to comment by Competitive_Dog_6639 in [R] The Forward-Forward Algorithm: Some Preliminary Investigations [Geoffrey Hinton] by shitboots
yes, they are like "local" updates I believe