Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks

Xiao, Lechao; Bahri, Yasaman; Sohl-Dickstein, Jascha; Schoenholz, Samuel S.; Pennington, Jeffrey

doi:10.48550/arxiv.1806.05393

Cited by 51 publications

(62 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…One must adapt the architecture and the optimization procedure to train them correctly. Some approaches focus on the initialization schemes [24,27,70], others on multiple stages training [53,56], multiple loss at different depth [61], adding components in the architecture [2,75] or regularization [33]. As pointed in our paper, in that respect our LayerScale approach is more related to Rezero [2] and Skipinit [16], Fixup [75], and T-Fixup [34].…”

Section: Related Workmentioning

confidence: 99%

Going deeper with Image Transformers

Touvron¹,

Cord²,

Sablayrolles³

et al. 2021

Preprint

View full text Add to dashboard Cite

Transformers have been recently adapted for large scale image classification, achieving high scores shaking up the long supremacy of convolutional neural networks. However the optimization of image transformers has been little studied so far. In this work, we build and optimize deeper transformer networks for image classification. In particular, we investigate the interplay of architecture and optimization of such dedicated transformers. We make two transformers architecture changes that significantly improve the accuracy of deep transformers. This leads us to produce models whose performance does not saturate early with more depth, for instance we obtain 86.3% top-1 accuracy on Imagenet when training with no external data. Our best model establishes the new state of the art on Imagenet with Reassessed labels and Imagenet-V2 / match frequency, in the setting with no additional training data.

show abstract

Section: Related Workmentioning

confidence: 99%

Going deeper with Image Transformers

Touvron¹,

Cord²,

Sablayrolles³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…We study the effect of the initialization hyperparameters on signal propagation for a very broad class of recurrent architectures, which includes as special cases many state-of-the-art RNN cells, including the GRU (Cho et al, 2014), the LSTM (Hochreiter and Schmidhuber, 1997), and the peephole LSTM (Gers et al, 2002). The analysis is based on the mean field theory of signal propagation developed in a line of prior work (Schoenholz et al, 2016;Xiao et al, 2018;Chen et al, 2018;Yang et al, 2019), as well as the concept of dynamical isometry (Saxe et al, 2013;Pennington et al, 2017; that is necessary for stable gradient backpropagation and which was shown to be crucial for training simpler RNN architectures (Chen et al, 2018). We perform a number of experiments to corroborate the results of the calculations and use them to motivate initialization schemes that outperform standard initialization approaches on a number of long sequence tasks.…”

Section: Introductionmentioning

confidence: 99%

Dynamical Isometry and a Mean Field Theory of LSTMs and GRUs

Gilboa,

Chang,

Chen

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

Training recurrent neural networks (RNNs) on long sequence tasks is plagued with difficulties arising from the exponential explosion or vanishing of signals as they propagate forward or backward through the network. Many techniques have been proposed to ameliorate these issues, including various algorithmic and architectural modifications. Two of the most successful RNN architectures, the LSTM and the GRU, do exhibit modest improvements over vanilla RNN cells, but they still suffer from instabilities when trained on very long sequences. In this work, we develop a mean field theory of signal propagation in LSTMs and GRUs that enables us to calculate the time scales for signal propagation as well as the spectral properties of the state-to-state Jacobians. By optimizing these quantities in terms of the initialization hyperparameters, we derive a novel initialization scheme that eliminates or reduces training instabilities. We demonstrate the efficacy of our initialization scheme on multiple sequence tasks, on which it enables successful training while a standard initialization either fails completely or is orders of magnitude slower. We also observe a beneficial effect on generalization performance using this new initialization.

show abstract

“…Turning to the titular edge of chaos, we are inspired by the aforementioned works [15,16,[27][28][29][30] examining criticality in various deep network architectures. However, while many of these papers used the phrase "mean-field theory", they did not actually rely on any MFT analysis: as mentioned above, Gaussianity arises simply as a consequence of the central limit theorem (CLT).…”

Section: Relation To Other Workmentioning

confidence: 99%

“…Before turning to perturbative QFT in the next section, we note that in [29] it was reported that in the case of RNNs, the injection of time-series data 28 x destroys the ordered phase, and consequently there is no order-to-chaos phase transition. This arises due to an extra factor that appears in their analogue of (3.39) containing possible correlations in x.…”

Section: The Largest Lyapunov Exponentmentioning

confidence: 99%

“…It has since appeared in many fields ranging from theoretical neuroscience and neurophysiology [21][22][23], to biological and complex systems [24,25], and was also explored in an early model of bioplausible neural networks in [26]. More recently, [15,16,[27][28][29][30] demonstrated that networks initialized at criticality are trainable to far greater depths than those lying further into either phase. To see this, one identifies a correlation length that sets the scale at which correlations between local degrees of freedom -e.g., the activations of two different neurons, or the spins of two different magnetic dipoles -decay with separation or depth through the network.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

The edge of chaos: quantum field theory and deep neural networks

Jefferson¹

2021

Preprint

View full text Add to dashboard Cite

We explicitly construct the quantum field theory corresponding to a general class of deep neural networks encompassing both recurrent and feedforward architectures. We first consider the mean-field theory (MFT) obtained as the leading saddlepoint in the action, and derive the condition for criticality via the largest Lyapunov exponent. We then compute the loop corrections to the correlation function in a perturbative expansion in the ratio of depth T to width N , and find a precise analogy with the well-studied O(N ) vector model, in which the variance of the weight initializations plays the role of the 't Hooft coupling. In particular, we compute both the O(1) corrections quantifying fluctuations from typicality in the ensemble of networks, and the subleading O(T /N ) corrections due to finite-width effects. These provide corrections to the correlation length that controls the depth to which information can propagate through the network, and thereby sets the scale at which such networks are trainable by gradient descent. Our analysis provides a first-principles approach to the rapidly emerging NN-QFT correspondence, and opens several interesting avenues to the study of criticality in deep neural networks.

show abstract

Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks

Cited by 51 publications

References 12 publications

Going deeper with Image Transformers

Going deeper with Image Transformers

Dynamical Isometry and a Mean Field Theory of LSTMs and GRUs

The edge of chaos: quantum field theory and deep neural networks

Contact Info

Product

Resources

About