Submitted by Blutorangensaft t3_121jylg in MachineLearning
One common technique to stabilise GAN training is to normalise the input data in some range, for example [-1, 1]. I have three questions regarding this:
-
Has there been a paper published on systematically investigating different normalisation schemes and their effect on convergence?
-
Is the type of normalisation related to the activation function used in the GAN? For instance, I would imagine that [0,1] works better with Relu and [-1, 1] with Sigmoid.
-
After normalisation within [0,1], my WGAN converges slowly but reliably to a critic loss of zero, starting from a high value. When I didn't use normalisation, the critic loss dropped from a high value to below zero, slowly approaching zero again from a negative value. What is more desireable?
KarlKani44 t1_jdng4it wrote
Normalization is much older than GANs and I don't think there are papers that investigate the effect of this specifically for GANs. To find papers that generally look into the effect of normalization, you would probably have to go back to papers from the 90ties that experimented with small MLPs and CNNs. Normalization just helps with convergence in general, which oftentimes is problematic with GANs
Normalization is not related to the activation function, since activations are done after at least one linear function which often includes a bias. This bias can easily shift your logits into any range, so the initial normalization doesn't have an effect on this. In my experience, a well designed GAN will converge with [-1, 1] just as well as with [0, 1], Making barely any difference. Just make sure you don't train with very high values (like [0, 255] for pixels). If my data is for example already in a range of something like [0, 1.5], I don't care about normalization that much.
WGAN Critic loss starts with a high negative value and converges to zero from there. See the paper "Improved Training of Wasserstein GANs" Figure 5(a), where they plot "negative critic loss". Depending on your batch size and steps per epoch, you might see a positive critic loss at the beginning which quickly goes into a "high" negative before it starts to converge to zero.
Usually you want your critic loss to slowly converge to zero. If it goes down to zero very fast, it might still work but your generated samples are probably not optimal. Generally I'd track the quality of the samples with another additional metric. In case of images you can use something like FID. In case of simpler data (or simple images like MNIST) there are also metrics like MMD that give you an idea of your sample quality which you can again use for improving your training.
WGANs often work better if the discriminator is bigger than the generator (around 3x the parameters in my experience). If you think your networks are designed quite well already, the next thing I would play with is the number of critic updates that are done before the generator update. I've seen people go up to 25 with this number (original uses 5 I think). The other hyperparameter that I'd play with is learning rate of Adam, but usually keeping it the same for generator and critic.