BrohammerOK t1_j2ekoyj wrote
If you do use both in the same layer, dropout should never be applied right before batch or layer norm because the features set to 0 would affect the mean and variance calculations. As an example, it is common to use batch norm in CNNs, and then dropout after the global average pooling (before the final fc layer). Sometimes you even see dropout between conv blocks, take a look at EfficientNet by Google.
Viewing a single comment thread. View all comments