What is the current state-of-the-art when it comes to the generalisation ability of autoencoders? I have been working with text autoencoders for some time and, although they work well on the training data, they generalise very poorly to unseen sentences (as, for example, noted here: https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=there+and+back+again+autoencoder&btnG=#d=gs_qabs&t=1678725350369&u=%23p%3DksKOTTf1c1IJ). How do image autoencoders do with unseen images? What research efforts are underway to improve generalisation ability?

Comments

You must log in or register to comment.

currentscurrents t1_jc31c23 wrote on March 13, 2023 at 5:58 PM

Vanilla autoencoders don't generalize well, but variational autoencoders have a much better structured latent space and generalize much better.

Generalization really comes down to inductive biases. Autoencoders are downscalers -> upscalers, so they have an inductive bias towards preserving large features in the data and discarding small details. This is reasonable for images but not so much for text.

But autoencoders are just one example of an information bottleneck model, which includes everything from autoregressive language models to diffusion models to U-Nets. (U-Nets are basically just autoencoders with skip connections!) They all throw away part of the data and learn how to reconstruct it.

Different kinds of bottlenecks have different inductive biases and are better suited to different kinds of data. Next-word-prediction seems to be better suited for text because it reflects the natural flow of language.

speyside42 t1_jc44rbn wrote on March 13, 2023 at 10:13 PM

> Vanilla autoencoders don't generalize well, but variational autoencoders have a much better structured latent space and generalize much better.

For toy problems yes, but not generally. For a generalizing Image Autoencoder, check for example ConvNextv2: https://arxiv.org/pdf/2301.00808.pdf

As a side note: The VQ-VAE from the blog post has actually really little to do with variational inference. You have basically no prior at all (uniform over all discrete latents) therefore the KL-divergence term can also be dropped. It's basically just a glorified quantized Autoencoder that could be interpreted in the language of variational models.

Red-Portal t1_jc4u84k wrote on March 14, 2023 at 1:16 AM

what do you mean by generalizing here? Reconstruction of OOD data? Ironically, VAEs are pissing everybody because they reconstruct OOD data too well. In fact, one of the things people are dying to get to work is anomaly detection or OOD detection, but VAEs suck at it despite all attempts. Like your dog who cannot guard your house because he really likes strangers, VAEs suck at OOD detection because they reconstruct OOD too well.

Noddybear t1_jc509fa wrote on March 14, 2023 at 2:02 AM

I spent a year with a team of engineers trying to get VAEs to work for textual anomaly detection. It didn't work that well.

currentscurrents t1_jc5fxq3 wrote on March 14, 2023 at 4:12 AM

Wouldn't that make them great for the task they're actually learning to do: compression? You want to be able to compress and reconstruct any input data, even if less efficient for OOD data.

I do wonder why we don't use autoencoders for data compression. But this may simply be because neural networks require 1000x more compute power than traditional compressors.

Red-Portal t1_jc5g7ap wrote on March 14, 2023 at 4:15 AM

Oh they have been used for compression. I also remember a paper on quantization, which made a buzz at the time.

currentscurrents t1_jc5ghbv wrote on March 14, 2023 at 4:18 AM

Would love to read some research papers if you have a link!

But I mean that we do not use them for compression in practice. We use hand-crafted algorithms, mostly DEFLATE for lossless + a handful of lossy DCT-based algorithms for audio/video/images.

Red-Portal t1_jc5gmb0 wrote on March 14, 2023 at 4:19 AM

Can't remember those on compression, but for quantization I was talking about this paper

FrogBearSalamander t1_jc5vvrb wrote on March 14, 2023 at 7:24 AM

> Would love to read some research papers if you have a link!

Nonlinear Transform Coding
An Introduction to Neural Data Compression
SoundStream: An End-to-End Neural Audio Codec
Old but foundational: End-to-end Optimized Image Compression
And this paper made the connection between compression models and VAEs: Variational image compression with a scale hyperprior
Any VQ-based model (VQ-VAE, VQ-GAN, etc.) can be interpreted as a compression model. Many generative image models use VQ but they don't often present rate-distortion results. And, as /u/speyside42 said above, they typically assume a uniform distribution over the codebook, which isn't very interesting from a compression point of view. Instead, you want to learn a distribution and use it as an entropy model in conjunction with an entropy coder. Note that SoundStream (mentioned above) uses residual VQ (RVQ).
Image Compression with Product Quantized Masked Image Modeling uses a kind of VQ (subdivide the latent vectors and code separate to form a product quantizer) along with masked image modeling (MIM) to get a conditional distribution over codewords. MIM is often used for generation but here they entropy code instead of sample.

[deleted] t1_jc2tugi wrote on March 13, 2023 at 5:09 PM

[deleted]