2021
DOI: 10.48550/arxiv.2110.15596
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Training Integrable Parameterizations of Deep Neural Networks in the Infinite-Width Limit

Abstract: To theoretically understand the behavior of trained deep neural networks, it is necessary to study the dynamics induced by gradient methods from a random initialization. However, the nonlinear and compositional structure of these models make these dynamics difficult to analyze. To overcome these challenges, large-width asymptotics have recently emerged as a fruitful viewpoint and led to practical insights on real-world deep networks. For two-layer neural networks, it has been understood via these asymptotics t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 12 publications
0
1
0
Order By: Relevance
“…Large learning rate in the initial steps can impact the conditioning of loss surface [JSF + 20, CKL + 21] and potentially improve the generalization performance [LWM19, LBD + 20]. Under structural assumptions on the data, it has been proved that one gradient step with sufficiently large learning rate can drastically decrease the training loss [CLB21], extract task-relevant features [DM20,FCB22], or escape the trivial stationary point at initialization [HCG21]. While these works also highlight the benefit of one feature learning step 2 , to our knowledge this advantage has not been precisely characterized in the proportional regime (where the performance of RF models has been extensively studied).…”
Section: Related Workmentioning
confidence: 99%
“…Large learning rate in the initial steps can impact the conditioning of loss surface [JSF + 20, CKL + 21] and potentially improve the generalization performance [LWM19, LBD + 20]. Under structural assumptions on the data, it has been proved that one gradient step with sufficiently large learning rate can drastically decrease the training loss [CLB21], extract task-relevant features [DM20,FCB22], or escape the trivial stationary point at initialization [HCG21]. While these works also highlight the benefit of one feature learning step 2 , to our knowledge this advantage has not been precisely characterized in the proportional regime (where the performance of RF models has been extensively studied).…”
Section: Related Workmentioning
confidence: 99%