Submitted by Blutorangensaft t3_121jylg in MachineLearning

One common technique to stabilise GAN training is to normalise the input data in some range, for example [-1, 1]. I have three questions regarding this:

  1. Has there been a paper published on systematically investigating different normalisation schemes and their effect on convergence?

  2. Is the type of normalisation related to the activation function used in the GAN? For instance, I would imagine that [0,1] works better with Relu and [-1, 1] with Sigmoid.

  3. After normalisation within [0,1], my WGAN converges slowly but reliably to a critic loss of zero, starting from a high value. When I didn't use normalisation, the critic loss dropped from a high value to below zero, slowly approaching zero again from a negative value. What is more desireable?

6

Comments

You must log in or register to comment.

KarlKani44 t1_jdng4it wrote

  1. Normalization is much older than GANs and I don't think there are papers that investigate the effect of this specifically for GANs. To find papers that generally look into the effect of normalization, you would probably have to go back to papers from the 90ties that experimented with small MLPs and CNNs. Normalization just helps with convergence in general, which oftentimes is problematic with GANs

  2. Normalization is not related to the activation function, since activations are done after at least one linear function which often includes a bias. This bias can easily shift your logits into any range, so the initial normalization doesn't have an effect on this. In my experience, a well designed GAN will converge with [-1, 1] just as well as with [0, 1], Making barely any difference. Just make sure you don't train with very high values (like [0, 255] for pixels). If my data is for example already in a range of something like [0, 1.5], I don't care about normalization that much.

  3. WGAN Critic loss starts with a high negative value and converges to zero from there. See the paper "Improved Training of Wasserstein GANs" Figure 5(a), where they plot "negative critic loss". Depending on your batch size and steps per epoch, you might see a positive critic loss at the beginning which quickly goes into a "high" negative before it starts to converge to zero.

Usually you want your critic loss to slowly converge to zero. If it goes down to zero very fast, it might still work but your generated samples are probably not optimal. Generally I'd track the quality of the samples with another additional metric. In case of images you can use something like FID. In case of simpler data (or simple images like MNIST) there are also metrics like MMD that give you an idea of your sample quality which you can again use for improving your training.

WGANs often work better if the discriminator is bigger than the generator (around 3x the parameters in my experience). If you think your networks are designed quite well already, the next thing I would play with is the number of critic updates that are done before the generator update. I've seen people go up to 25 with this number (original uses 5 I think). The other hyperparameter that I'd play with is learning rate of Adam, but usually keeping it the same for generator and critic.

3

snylekkie t1_jdo56fy wrote

You seem knowledgeable. Do you work in ML ?

2

KarlKani44 t1_jdo96om wrote

Thanks! :) I work as machine learning engineer since around 2 years at quite a big company (not FAANG). There's a good chance that you have used models that I trained! I also did my masters degree mostly in the field of image generation

2

Blutorangensaft OP t1_jdnjfmu wrote

Thank you for the thorough answer. 1) I see, I will just trust my normalisation scheme then. 2) That makes sense. 3) Is the training curve you describe the only possible one for the critic loss? Because, with normalisation, I see the critic loss approaching 0 from a positive value. Could this mean that the generator's job became easier due to normalisation? Does it make sense to think about improving the critic then (like you described, with 3 times the params)? Also, I read about and tried scheduling, but I am using TTUR instead for its better convergence properties.

1

KarlKani44 t1_jdnosqr wrote

> Is the training curve you describe the only possible one for the critic loss?

Well, that's hard to say. If it works I wouldn't say it's wrong, but it would still make me think. Generally, in the case of WGAN, it's always a bit hard to say if the problem is a too strong generator or a too strong discriminator. With normal GANs, you see that the discriminator can differentiate very easily when you look at it's accuracy. With WGANs you can look at the distribution of output logits from the critic for real and generated samples. If the distribution is easily separatable, the discriminator is able to separate real from fake samples. During training the distribution of output logits should converge to look the same for both datasets.

From my experience and understanding: You want a very strong discriminator in WGAN training, since the gradient of its forward pass will still be very smooth because of the used lipschitz constraint (enforced through gradient penalty). This is also why you train it multiple times before a generator update. You want it to be very strong so the generator can use it as guidance. In vanilla GANs this would be a problem because the generator can not keep up. This is also why WGANs are easier to train. You don't have to keep this hard to achieve balance between the two networks.

If you look at the keras tutorial about WGAN-GP, their critic has 4.3M parameters, while the generator only has 900k. A vanilla GAN would not converge with models like this because the discriminator would be too strong. Their critic loss also starts at -7 and goes down very smoothly from there.

> Could this mean that the generator's job became easier due to normalisation

I would agree with this hypothesis. I'd say your critic is not able to properly tell the real samples from the generated ones right at the beginning. Probably the normalization helped the generator more than the critic. Try to make it stronger by scaling up the network or train it more often before updating the generator and see if the critic loss starts at negative values. Also try to do the before mentioned plot of the critic's output logits to see if the critic is able to separate real from fake at early epochs.

I haven't used scheduling with GANs before, but it might help. I would still try to get a stable training with nice looking output first and then try more tricks like scheduling and TTUR. With Adam I usually don't to any tricks like this though.

2

Blutorangensaft OP t1_jdnxm94 wrote

I see, I will improve my critic then (maybe give it more depth) and abstain from tricks like TTUR for now.

What do you mean with "easily seperable distribution of output logits" btw? Plotting the scores the critic assigns for real and fake samples separately? Or do you mean taking mean and standard deviation of the logits for real and fake data and comparing those?

1

KarlKani44 t1_jdnzo65 wrote

>Plotting the scores the critic assigns for real and fake samples separately? Or do you mean taking mean and standard deviation of the logits for real and fake data and comparing those?

Both ot those work. I like to plot the critic output of real samples into a histogram and then do the same for generated samples. This shows you how well your critic does at separating real from fake samples. You can do this every few epochs during training. You should see that at early epochs those two histograms barely overlap and during the training they will get closer to each other.

It might look like this: https://imgur.com/a/OknV5l0

the left plot is at early training, the right is after some epochs when the critic partially converged. At the end they will overlap almost completely

2

Blutorangensaft OP t1_jdnzzx1 wrote

Love the visualisation, I will definitely do that. Thanks so much for answering all my questions.

1