I am working with Resnets consisting of feedforward networks. Additionally, I am using Kaiming-He weight initialisation and ReLU as an activation function. Extending the network to more than 10 layers leads to vanishing gradients. I cannot use batch normalization because that would violate assumptions of a gradient penalty. What should I do? Should I form residual connections over longer steps? Should I implement artificial derivatives? What's the common remedy here?

Comments

You must log in or register to comment.

IntelArtiGen t1_jcyqmdu wrote on March 20, 2023 at 4:12 PM

Do you use another kind of normalization? You can try InstanceNorm / LayerNorm if you can't use batchnorm.

Blutorangensaft OP t1_jcywm46 wrote on March 20, 2023 at 4:51 PM

I don't. I have heard of using layer norm for RNNs, but I am unfamiliar with instance norm. Will look into it, thank you.

G_fucking_G t1_jczd46d wrote on March 20, 2023 at 6:37 PM

https://old.reddit.com/r/MachineLearning/comments/px3hzd/d_has_the_resnet_hypothesis_been_debunked/

The advantage of ResNets are most probably not the erasure of vanishing gradients but a smoothing of the loss landscape.

YouAgainShmidhoobuh t1_jd2n2v5 wrote on March 21, 2023 at 12:12 PM

ResNets do not tackle the vanishing gradient problem. The authors specifically mention that the issue of vanishing gradients was already fixed because of BatchNorm in particular. So removing BatchNorm from the equation will most likely lead to vanishing gradients.

I am assuming you are doing a WGAN approach since that would explain the gradient penalty violation. In this case, use LayerNorm as indicated here: https://github.com/LynnHo/DCGAN-LSGAN-WGAN-GP-DRAGAN-Tensorflow-2/issues/3

Blutorangensaft OP t1_jd7jaor wrote on March 22, 2023 at 12:20 PM

Thank you for your comment. I have not worked with ResNets before, and the paper I used as a basis erroneously stated that they chose this architecture because of vanishing gradients. Wikipedia has the same error it seems.

Indeed, I am working with WGAN-GP. Unfortunately, implementing layer norm, while enabling me to scale the depth, completely changes the training dynamics. Training both G and C with the same learning rate and the same schedule (1:1), the critic seems to win, a behaviour I have never seen before in GANs. I suppose I will have to retune learning rates.

deep_alichemist t1_jd2mzoh wrote on March 21, 2023 at 12:11 PM

Use any kind of normalization additionally to skip connections. ResNet alone is not enough, except if you carefully tune everything (eg. https://arxiv.org/abs/1901.09321).

paulgavrikov t1_jd3k9l9 wrote on March 21, 2023 at 4:14 PM

Have you tried NFNets https://arxiv.org/abs/2102.06171