Size-Independent Sample Complexity of Neural Networks

Golowich, Noah; Rakhlin, Alexander; Shamir, Ohad

doi:10.48550/arxiv.1712.06541

Cited by 52 publications

(102 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It is well known that the VC-dimension of neural networks is at least linear in the number of parameters (Bartlett et al, 2017b), and therefore classical VC theory cannot explain the generalization ability of modern neural networks with more parameters than training samples. Researchers have proposed norm-based generalization bounds (Bartlett & Mendelson, 2002;Bartlett et al, 2017a;Neyshabur et al, 2015Neyshabur et al, , 2017Neyshabur et al, , 2019Konstantinos et al, 2017;Golowich et al, 2017;Li et al, 2018a) and compression-based bounds (Arora et al, 2018). Dziugaite & Roy (2017); Zhou et al (2019) used the PAC-Bayes approach to compute non-vacuous generalization bounds for MNIST and ImageNet, respectively.…”

Section: Related Workmentioning

confidence: 99%

Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks

Arora¹,

Du²,

Hu³

et al. 2019

Preprint

152

228

View full text Add to dashboard Cite

Recent works have cast some light on the mystery of why deep nets fit any data and generalize despite being very overparametrized. This paper analyzes training and generalization for a simple 2-layer ReLU net with random initialization, and provides the following improvements over recent works: (i) Using a tighter characterization of training speed than recent papers, an explanation for why training a neural net with random labels leads to slower training, as originally observed in [Zhang et al. ICLR'17].(ii) Generalization bound independent of network size, using a data-dependent complexity measure.Our measure distinguishes clearly between random labels and true labels on MNIST and CIFAR, as shown by experiments. Moreover, recent papers require sample complexity to increase (slowly) with the size, while our sample complexity is completely independent of the network size.(iii) Learnability of a broad class of smooth functions by 2-layer ReLU nets trained via gradient descent.The key idea is to track dynamics of training and generalization via properties of a related kernel.

show abstract

Section: Related Workmentioning

confidence: 99%

Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks

Arora¹,

Du²,

Hu³

et al. 2019

Preprint

152

228

View full text Add to dashboard Cite

show abstract

“…ticularly, one open question looms large from the recent literature: how to theoretically ensure the generalization performance of the NNs when it is trained with finitely many samples. To address this question, some significant advances have been reported by Bartlett et al (2017), Golowich et al (2017), Neyshabur et al (2015Neyshabur et al ( , 2017Neyshabur et al ( , 2018, and Arora et al (2018). For most of the existing results, the generalization error depends polynomially in the dimensionality (number of weights).…”

Section: Regularized Neural Networkmentioning

confidence: 99%

High-Dimensional Learning under ApproximateSparsity with Applications to Nonsmooth Estimation and Regularized Neural Networks

Liu¹,

Ye²,

Lee³

2019

Preprint

View full text Add to dashboard Cite

High-dimensional statistical learning (HDSL) has been widely applied in data analysis, operations research, and stochastic optimization. Despite the availability of multiple theoretical frameworks, most HDSL theories stipulate the following two conditions, which are sometimes overly critical: (a) the sparsity, and (b) the restricted strong convexity (RSC). This paper generalizes both conditions via the use of the folded concave penalty (FCP); we show that, for an M-estimation problem where (i) the (conventional) sparsity is relaxed into the approximate sparsity and (ii) the RSC is completely absent, the FCP-based regularization leads to poly-logarithmic sample complexity: the size of the training data is only required to be poly-logarithmic in the problem dimensionality. This finding allows us to further understand two important paradigms much less discussed formerly: the high-dimensional nonsmooth learning and the (deep) neural networks (NN). For both problems, we show that the poly-logarithmic sample complexity can be maintained. Furthermore, via integrating the NN with the FCP, the excess risk of a stationary point to the training formulation for the NN is strictly monotonic with respect to the solution's suboptimality gap, providing the first theoretical evidence for the empirically observed consistency between the generalization performance and the optimization quality in training an NN.

show abstract

“…For example, Graves et al (2013) report that after training with merely 462 speech samples, deep LSTM RNNs achieve a test set error of 17.7% on TIMIT phoneme recognition benchmark, which is the best recorded score. Despite of the popularity of RNNs in applications, their theory is less studied than other feedforward neural networks (Haussler, 1992;Bartlett et al, 2017;Neyshabur et al, 2017;Golowich et al, 2017;Li et al, 2018). There are still several long lasting fundamental questions regarding the approximation, trainability, and generalization of RNNs.…”

Section: Introductionmentioning

confidence: 99%

On Generalization Bounds of a Family of Recurrent Neural Networks

Chen¹,

Li²,

Zhao³

2019

Preprint

View full text Add to dashboard Cite

Recurrent Neural Networks (RNNs) have been widely applied to sequential data analysis. Due to their complicated modeling structures, however, the theory behind is still largely missing. To connect theory and practice, we study the generalization properties of vanilla RNNs as well as their variants, including Minimal Gated Unit (MGU), Long Short Term Memory (LSTM), and Convolutional (Conv) RNNs. Specifically, our theory is established under the PAC-Learning framework. The generalization bound is presented in terms of the spectral norms of the weight matrices and the total number of parameters. We also establish refined generalization bounds with additional norm assumptions, and draw a comparison among these bounds. We remark: (1) Our generalization bound for vanilla RNNs is significantly tighter than the best of existing results; (2) We are not aware of any other generalization bounds for MGU, LSTM, and Conv RNNs in the exiting literature; (3) We demonstrate the advantages of these variants in generalization.

show abstract

Size-Independent Sample Complexity of Neural Networks

Cited by 52 publications

References 0 publications

Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks

Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks

High-Dimensional Learning under ApproximateSparsity with Applications to Nonsmooth Estimation and Regularized Neural Networks

On Generalization Bounds of a Family of Recurrent Neural Networks

Contact Info

Product

Resources

About