Submitted by Blutorangensaft t3_11qejcz in MachineLearning
What is the current state-of-the-art when it comes to the generalisation ability of autoencoders? I have been working with text autoencoders for some time and, although they work well on the training data, they generalise very poorly to unseen sentences (as, for example, noted here: https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=there+and+back+again+autoencoder&btnG=#d=gs_qabs&t=1678725350369&u=%23p%3DksKOTTf1c1IJ). How do image autoencoders do with unseen images? What research efforts are underway to improve generalisation ability?
currentscurrents t1_jc31c23 wrote
Vanilla autoencoders don't generalize well, but variational autoencoders have a much better structured latent space and generalize much better.
Generalization really comes down to inductive biases. Autoencoders are downscalers -> upscalers, so they have an inductive bias towards preserving large features in the data and discarding small details. This is reasonable for images but not so much for text.
But autoencoders are just one example of an information bottleneck model, which includes everything from autoregressive language models to diffusion models to U-Nets. (U-Nets are basically just autoencoders with skip connections!) They all throw away part of the data and learn how to reconstruct it.
Different kinds of bottlenecks have different inductive biases and are better suited to different kinds of data. Next-word-prediction seems to be better suited for text because it reflects the natural flow of language.