2019
DOI: 10.48550/arxiv.1909.12292
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks

Abstract: Recent theoretical work has guaranteed that overparameterized networks trained by gradient descent achieve arbitrarily low training error, and sometimes even low test error. The required width, however, is always polynomial in at least one of the sample size n, the (inverse) target error 1 /ǫ, and the (inverse) failure probability 1 /δ. This work shows that O( 1 /ǫ) iterations of gradient descent with Ω( 1 /ǫ 2 ) training examples on two-layer ReLU networks of any width exceeding polylog(n, 1 /ǫ, 1 /δ) suffice… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
65
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 31 publications
(65 citation statements)
references
References 16 publications
0
65
0
Order By: Relevance
“…The exponential convergence rate is a direct outcome of Lemma 3.5 and Theorem 3.1. Note that our exponential convergence rate is much faster than existing ones under the similar separable setting [22,24,25], which are all polynomial with n, e.g.,…”
Section: Transition From Separable To Non-separablementioning
confidence: 93%
See 2 more Smart Citations
“…The exponential convergence rate is a direct outcome of Lemma 3.5 and Theorem 3.1. Note that our exponential convergence rate is much faster than existing ones under the similar separable setting [22,24,25], which are all polynomial with n, e.g.,…”
Section: Transition From Separable To Non-separablementioning
confidence: 93%
“…There are also other works studying the generalization performance of the neural network as a nonparametric regressor, out of the NTK regime; see [47,48]. For classification, most of the existing results are established based on the separable data; see [22,24,49] and references therein. In particular, Hu et al [23] consider classification with noisy labels (labels are randomly flipped) and propose to use the square loss with 2 regularization.…”
Section: A Gradient Descent and Neural Tangent Kernelmentioning
confidence: 99%
See 1 more Smart Citation
“…[3,9]) require a very large width, say, at least Ω(n 2 ). Recently, [20] [10] showed that the width can be smaller than O(n) if the dataset is separable with a large enough margin. More specifically, the number of parameters required is inversely proportional to the margin by which the binary labelled dataset can be separated, thus for certain dataset the width can be as small as poly(log(n)).…”
Section: Relationship To Prior Workmentioning
confidence: 99%
“…Most works explore the NTK-type analysis to prove a generalization bound (e.g., [3,9]), but these works require a very large width, say, at least Ω(n 2 ). References [20,10] showed that the requirement for the width can be small if the data satisfies certain assumptions (i.e., the margin is large enough), but the width can still be large for other data distributions. One notable exception is [1], which used a different initialization (scaling by 1/m instead of 1/ √ m) and thus allow the trajectory to go beyond the kernel regime.…”
Section: Introductionmentioning
confidence: 99%