This paper first answers the question "why do the two most powerfultechniques Dropout and Batch Normalization (BN) often lead to a worseperformance when they are combined together?" in both theoretical andstatistical aspects. Theoretically, we find that Dropout would shift thevariance of a specific neural unit when we transfer the state of that networkfrom train to test. However, BN would maintain its statistical variance, whichis accumulated from the entire learning procedure, in the test phase. Theinconsistency of that variance (we name this scheme as "variance shift") causesthe unstable numerical behavior in inference that leads to more erroneouspredictions finally, when applying Dropout before BN. Thorough experiments onDenseNet, ResNet, ResNeXt and Wide ResNet confirm our findings. According tothe uncovered mechanism, we next explore several strategies that modifiesDropout and try to overcome the limitations of their combination by avoidingthe variance shift risks.
展开▼