Submitted by Blutorangensaft t3_11wmpoj in MachineLearning
I am working with Resnets consisting of feedforward networks. Additionally, I am using Kaiming-He weight initialisation and ReLU as an activation function. Extending the network to more than 10 layers leads to vanishing gradients. I cannot use batch normalization because that would violate assumptions of a gradient penalty. What should I do? Should I form residual connections over longer steps? Should I implement artificial derivatives? What's the common remedy here?
IntelArtiGen t1_jcyqmdu wrote
Do you use another kind of normalization? You can try InstanceNorm / LayerNorm if you can't use batchnorm.