Submitted by grid_world t3_11qazip in deeplearning
I have a use-case where (say) N RGB input images are used to reconstruct a single RGB output image, using either an Autoencoder, or a U-Net architecture. More concretely, if N = 18, 18 RGB input images are used as input to a CNN which should then predict one target RGB output image.
If the spatial width and height are 90, then one input sample might be (18, 3, 90, 90) which is not batch-size = 18! AFAIK, (18, 3, 90, 90) as input to a CNN will reproduce (18, 3, 90, 90) as output, whereas, I want (3, 90, 90) as the desired output.
Any idea how to achieve this?
suflaj t1_jc6n8v1 wrote
Just apply an aggregation function on the 0th axis. This can be sum, mean, min, max, whatever. The best is sum, since your loss function will naturally regularise the weights to be smaller and it's the easiest to differentiate. This is in the case you know you have 18 images, for the scenario where you will have a variable amount of images, use mean. The rest are non-differentiable and might give you problems.
If you use sum, make sure you do gradient clipping so the gradients don't explode in the beginning.