Neural networks are commonly initialized to keep the theoretical variance of the hidden pre-activations constant, in order to avoid the vanishing and exploding gradient problem. Though this condition is necessary to train very deep networks, numerous analyses showed that it is no
...
Neural networks are commonly initialized to keep the theoretical variance of the hidden pre-activations constant, in order to avoid the vanishing and exploding gradient problem. Though this condition is necessary to train very deep networks, numerous analyses showed that it is not sufficient. We explain this fact by analyzing the behavior of the empirical variance which is more meaningful in practice of data sets of finite size. We demonstrate its discrepancy with the theoretical variance which grows with depth. We study the output distribution of neural networks at initialization in terms of its kurtosis which we find to grow to infinity with increasing depth even if the theoretical variance stays constant. The result of this is that the empirical variance vanishes: its asymptotic distribution converges in probability to zero. Our analysis, which studies increased dependence of outputs, focuses on fully-connected ReLU networks with He initialization, but we hypothesize that many more random weight initialization methods suffer from either vanishing or exploding empirical variance. We support this hypothesis experimentally and demonstrate the failure of state-of-the-art random initialization methods in very deep regimes.