2018
DOI: 10.48550/arxiv.1806.05393
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks

Abstract: In recent years, state-of-the-art methods in computer vision have utilized increasingly deep convolutional neural network architectures (CNNs), with some of the most successful models employing hundreds or even thousands of layers. A variety of pathologies such as vanishing/exploding gradients make training such deep networks challenging. While residual connections and batch normalization do enable training at these depths, it has remained unclear whether such specialized architecture designs are truly necessa… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
60
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 51 publications
(62 citation statements)
references
References 12 publications
2
60
0
Order By: Relevance
“…One must adapt the architecture and the optimization procedure to train them correctly. Some approaches focus on the initialization schemes [24,27,70], others on multiple stages training [53,56], multiple loss at different depth [61], adding components in the architecture [2,75] or regularization [33]. As pointed in our paper, in that respect our LayerScale approach is more related to Rezero [2] and Skipinit [16], Fixup [75], and T-Fixup [34].…”
Section: Related Workmentioning
confidence: 99%
“…One must adapt the architecture and the optimization procedure to train them correctly. Some approaches focus on the initialization schemes [24,27,70], others on multiple stages training [53,56], multiple loss at different depth [61], adding components in the architecture [2,75] or regularization [33]. As pointed in our paper, in that respect our LayerScale approach is more related to Rezero [2] and Skipinit [16], Fixup [75], and T-Fixup [34].…”
Section: Related Workmentioning
confidence: 99%
“…We study the effect of the initialization hyperparameters on signal propagation for a very broad class of recurrent architectures, which includes as special cases many state-of-the-art RNN cells, including the GRU (Cho et al, 2014), the LSTM (Hochreiter and Schmidhuber, 1997), and the peephole LSTM (Gers et al, 2002). The analysis is based on the mean field theory of signal propagation developed in a line of prior work (Schoenholz et al, 2016;Xiao et al, 2018;Chen et al, 2018;Yang et al, 2019), as well as the concept of dynamical isometry (Saxe et al, 2013;Pennington et al, 2017; that is necessary for stable gradient backpropagation and which was shown to be crucial for training simpler RNN architectures (Chen et al, 2018). We perform a number of experiments to corroborate the results of the calculations and use them to motivate initialization schemes that outperform standard initialization approaches on a number of long sequence tasks.…”
Section: Introductionmentioning
confidence: 99%
“…Turning to the titular edge of chaos, we are inspired by the aforementioned works [15,16,[27][28][29][30] examining criticality in various deep network architectures. However, while many of these papers used the phrase "mean-field theory", they did not actually rely on any MFT analysis: as mentioned above, Gaussianity arises simply as a consequence of the central limit theorem (CLT).…”
Section: Relation To Other Workmentioning
confidence: 99%
“…Before turning to perturbative QFT in the next section, we note that in [29] it was reported that in the case of RNNs, the injection of time-series data 28 x destroys the ordered phase, and consequently there is no order-to-chaos phase transition. This arises due to an extra factor that appears in their analogue of (3.39) containing possible correlations in x.…”
Section: The Largest Lyapunov Exponentmentioning
confidence: 99%
See 1 more Smart Citation