Submitted by Agreeable-Run-9152 t3_101s5kj in MachineLearning

Usually when you approximate the score s(x,t) in Diffusion models, the time t is passed through an embedding network before it is added to the x components in the res net blocks of your model.

What is the rationale behind this? Couldnt you just concatenate x and t in the channel dimension? And If you were to use any other model than a UNet, what would be the equivalent?

5

Comments

You must log in or register to comment.

bloc97 t1_j2pj1c6 wrote

There are many ways to condition a diffusion model using time, but concatenating it as input is the least efficient method because:

  1. The first layer of your model is a convolutional layer, applying a convolution on a "time" image that has the same value everywhere is not computationally efficient. Early conv layers exist to detect variations in an image (eg. texture), applying the same kernel over and over on an empty image is not efficient.
  2. By giving t only to the first layer, the network will need to waste resources/neurons to propagate that information through the network. Again, this waste is compounded by the fact that you will need to propagate the time information for every "pixel" in each convolutional feature of your network (because it is a ConvNet). Why not just skip all that and directly give the time embedding to deeper layers within the network?
3

bloc97 t1_j2q0aio wrote

I'm not too familiar with FNOs, but I guess you could start experimenting by adding the time embeddings to the "DC component" of the fourier transform, it would be at least equivalent to adding the time embeddings to the entire feature in a ResNet.

2