2020
DOI: 10.48550/arxiv.2001.04413
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Backward Feature Correction: How Deep Learning Performs Deep Learning

Abstract: How does a 110-layer ResNet learn a high-complexity classifier using relatively few training examples and short training time? We present a theory towards explaining this in terms of hierarchical learning. We refer hierarchical learning as the learner learns to represent a complicated target function by decomposing it into a sequence of simpler functions to reduce sample and time complexity. This paper formally analyzes how multi-layer neural networks can perform such hierarchical learning efficiently and auto… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

2
122
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 35 publications
(124 citation statements)
references
References 27 publications
2
122
0
Order By: Relevance
“…Additionally, the authors also demonstrate experimentally that depth 3 networks outperform depth 2 networks for learning ball indicators, even if the shallow network has many more tweakable weights. While we build on a similar proof strategy as in Safran and Shamir [25], our work differs from theirs in that they show that for a ball indicator function of any radius, there exists a distribution under which it is hard to approximate using depth 2 networks, whereas our work shows that for a specific distribution there exists a ball indicator function with radius in the interval [1,2] which is hard to approximate using a depth 2 network. Moreover, our main result rigorously proves the learnability of the target function used in their experiment.…”
Section: Related Workmentioning
confidence: 98%
See 4 more Smart Citations
“…Additionally, the authors also demonstrate experimentally that depth 3 networks outperform depth 2 networks for learning ball indicators, even if the shallow network has many more tweakable weights. While we build on a similar proof strategy as in Safran and Shamir [25], our work differs from theirs in that they show that for a ball indicator function of any radius, there exists a distribution under which it is hard to approximate using depth 2 networks, whereas our work shows that for a specific distribution there exists a ball indicator function with radius in the interval [1,2] which is hard to approximate using a depth 2 network. Moreover, our main result rigorously proves the learnability of the target function used in their experiment.…”
Section: Related Workmentioning
confidence: 98%
“…• We show that there exist a sequence {D d } ∞ d=2 of d-dimensional distributions and a sequence of constants {λ d } ∞ d=2 , where λ d ∈ [1,2] for all d, such that no neural network of depth 2 and width less than Ω (exp(Ω(d))) can approximate ball indicator functions with radii {λ d } ∞ d=2 to accuracy better than Ω(d −4 ) (Thm. 2.2).…”
Section: Our Contributionsmentioning
confidence: 99%
See 3 more Smart Citations