2020
DOI: 10.1088/1742-5468/abc62b
|View full text |Cite
|
Sign up to set email alerts
|

Wide neural networks of any depth evolve as linear models under gradient descent *

Abstract: A longstanding goal in deep learning research has been to precisely characterize training and generalization. However, the often complex loss landscapes of neural networks (NNs) have made a theory of learning dynamics elusive. In this work, we show that for wide NNs the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters. Furthermore, mirroring the correspon… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

14
657
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
6

Relationship

0
6

Authors

Journals

citations
Cited by 386 publications
(672 citation statements)
references
References 8 publications
14
657
1
Order By: Relevance
“…Q depends on the softmax function g at each time step. One can numerically solve the training dynamics of g and obtain the theoretical value of the training loss (Lee et al, 2019). In Figure 5 (left), we confirmed that the theoretical line coincided well with the experimental results of gradient descent training.…”
Section: Theorem 4 F Cross Has the First C Largest Eigenvalues Of O(m)supporting
confidence: 78%
See 4 more Smart Citations
“…Q depends on the softmax function g at each time step. One can numerically solve the training dynamics of g and obtain the theoretical value of the training loss (Lee et al, 2019). In Figure 5 (left), we confirmed that the theoretical line coincided well with the experimental results of gradient descent training.…”
Section: Theorem 4 F Cross Has the First C Largest Eigenvalues Of O(m)supporting
confidence: 78%
“…We used is necessary for the steepest gradient method to converge (Karakida et al, 2019b). In fact, this 2/λ max acts as a boundary of neural tangent kernel regime (Lee et al, 2019;Lewkowycz, Bahri, Dyer, Sohl-Dickstein, & Gur-Ari, 2020). Because λ max increases depending on the width and depth, we need to carefully choose an appropriately scaled learning rate to train the DNNs.…”
Section: C)mentioning
confidence: 99%
See 3 more Smart Citations