I don’t want to mix up the individual information contained in 18 RGB images together in the hopes the the network learns the anticipated features out of them
I have been running some experiments on toy datasets (MNIST, CIFAR-10) and for now it seems that very few of the latent variables z measured with mu and logvar vectors are almost never 0. And mathematically it makes sense since all of the latent variables will learn at least some information which is not garbage (standard Gaussian). So deciding the optimal number of latent space dimensionality is still eluding
Retraining the model with reduced dimensions would be a *rough* way of _proving_ this. But the stochastic behavior of neural networks makes this hard to achieve.
I don’t think that the Gaussians are being output by a layer. In contrast with an Autoencoder, where a sample is encoded to a single point, in a VAE, due to the Gaussian prior, a sample is now encoded as a Gaussian distribution. This is the regularisation effect which enforces this distribution in the latent space. It cuts both ways, meaning that if the true manifold is not Gaussian, we still assume and therefore force it to be Gaussian.
A Gaussian signal being meaningful is something that I wouldn’t count on. Diffusion models are a stark contrast, but we aren’t talking about them. The farther a signal is away from a standard Gaussian, the more information it’s trying to smuggle through the bottleneck.
I didn’t get your point of looking at the decoder weights to figure out whether they are contributing? Do you compare them to their randomly initiated values to infer this?
For a given dataset (say, CIFAR-10), if you intentionally keep the latent space dimensionality to be large, 1000-d, I am assuming that during learning, the model will automatically not use the dimensions it doesn't need to optimize the reconstruction and KL-divergence losses. Consequently, these variables will be either or very close to a multivariable, standard, Gaussian distribution(s). Is my hand wavy thought correct? And if yes, are there any research paper which prove this?
grid_world OP t1_jc6ox1c wrote
Reply to comment by notgettingfined in Image reconstruction by grid_world
Maybe a conv3d helps better without having to reshape