The idea behind DDIM is to make the reverse process deterministic, i.e. conveting the SDE into an ODE (eq. 14). Now that said, an ODE can be integrated backwards in time starting from final solution (the clean image x_0), integrating with negative `dt`, reaching at noise (i.e. "encoded feature") x_T. Thus, you get a negative sign in front of the nosie-estimator `\epsilon_{\theta}` and then treat it like a normal ODE and integrate from end-time (t=0) to start-time (t=T).
What I think is that you should treat the DDIM as an auto-encoder, i.e., the encoder encoding the clean image $x_0$ to $x_t$ by the forward process, and the revers process is a decoder that decoding the $x_t$ to $\hat{x_0}$, and notice that the noise $x_t$ is not a pure gaussian noise, but the image $x_0$ added some gaussian noise.
Sorry, but that's not really the correct interpretation. The "forward process" is not the encoder -- it's a stochastic process. The encoder is the "reverse of eq.14", which is integrating the ODE in eq.14 backwards in time -- that is not same as the "forward process".
Do you mean that the forward process is what only adding pure gaussian noise with different $\beta$ to $x_0$ to obtain $x_T$, but the encoder here is adding the noise predicted by the estimator $\epsilon_{\theta}(x_t)$ to $x_t$?
Your first part of the statement is correct -- that is called the "forward process" and it is only needed at training time.
Yes, the encoder in DDIM is basically adding a predicted-noise to travel back to x_T -- it's more like the "reverse of the reverse process", but we can't really call it the "forward process", can we? For example, the true "forward process" is almost entire random and you can skip to any x_t by re-parameterization. This isn't true for DDIM's "reverse of the reverse process" -- it must be sequential and deterministic.
dasayan05 t1_iqy7x8z wrote
Yes, you get the noise from the U-Net itself.
The idea behind DDIM is to make the reverse process deterministic, i.e. conveting the SDE into an ODE (eq. 14). Now that said, an ODE can be integrated backwards in time starting from final solution (the clean image x_0), integrating with negative `dt`, reaching at noise (i.e. "encoded feature") x_T. Thus, you get a negative sign in front of the nosie-estimator `\epsilon_{\theta}` and then treat it like a normal ODE and integrate from end-time (t=0) to start-time (t=T).