Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks

Ji, Ziwei; Telgarsky, Matus

doi:10.48550/arxiv.1909.12292

Cited by 31 publications

(65 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The exponential convergence rate is a direct outcome of Lemma 3.5 and Theorem 3.1. Note that our exponential convergence rate is much faster than existing ones under the similar separable setting [22,24,25], which are all polynomial with n, e.g.,…”

Section: Transition From Separable To Non-separablementioning

confidence: 93%

“…There are also other works studying the generalization performance of the neural network as a nonparametric regressor, out of the NTK regime; see [47,48]. For classification, most of the existing results are established based on the separable data; see [22,24,49] and references therein. In particular, Hu et al [23] consider classification with noisy labels (labels are randomly flipped) and propose to use the square loss with 2 regularization.…”

Section: A Gradient Descent and Neural Tangent Kernelmentioning

confidence: 99%

“…Most such results are in the regression setting with a handful of exceptions. Ji and Telgarsky [22] showed that only polylogarithmic width is sufficient for gradient descent to overfit the training data using logistic loss. Hu et al [23] proved generalization error bound for regularized NTK in classification.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Understanding Square Loss in Training Overparametrized Neural Network Classifiers

Hu¹,

Wang²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Deep learning has achieved many breakthroughs in modern classification tasks. Numerous architectures have been proposed for different data structures but when it comes to the loss function, the cross-entropy loss is the predominant choice. Recently, several alternative losses have seen revived interests for deep classifiers. In particular, empirical evidence seems to promote square loss but a theoretical justification is still lacking. In this work, we contribute to the theoretical understanding of square loss in classification by systematically investigating how it performs for overparametrized neural networks in the neural tangent kernel (NTK) regime. Interesting properties regarding the generalization error, robustness, and calibration error are revealed. We consider two cases, according to whether classes are separable or not. In the general non-separable case, fast convergence rate is established for both misclassification rate and calibration error. When classes are separable, the misclassification rate improves to be exponentially fast. Further, the resulting margin is proven to be lower bounded away from zero, providing theoretical guarantees for robustness. We expect our findings to hold beyond the NTK regime and translate to practical settings. To this end, we conduct extensive empirical studies on practical neural networks, demonstrating the effectiveness of square loss in both synthetic low-dimensional data and real image data. Comparing to cross-entropy, square loss has comparable generalization error but noticeable advantages in robustness and model calibration. introductionThe pursuit of better classifiers has fueled the progress of machine learning and deep learning research. The abundance of benchmark image datasets, e.g., MNIST, CIFAR, ImageNet, etc., provides test fields for all kinds of new classification models, especially those based on deep neural networks (DNN). With the introduction of CNN, ResNets, and transformers, DNN classifiers are constantly improving and catching up to the human-level performance. In contrast to the active innovations in model architecture, the training objective remains largely stagnant, with cross-entropy loss being the default choice. Despite its popularity, cross-entropy has been shown to be problematic in some applications. Among others, Yu et al. [1] argued that features learned from cross-entropy lack interpretability and proposed a new loss aiming for maximum coding rate reduction. Pang et al. [2] linked the use of cross-entropy to adversarial vulnerability and proposed a new classification loss

show abstract

Section: Transition From Separable To Non-separablementioning

confidence: 93%

Section: A Gradient Descent and Neural Tangent Kernelmentioning

confidence: 99%

See 1 more Smart Citation

Understanding Square Loss in Training Overparametrized Neural Network Classifiers

Hu¹,

Wang²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…[3,9]) require a very large width, say, at least Ω(n 2 ). Recently, [20] [10] showed that the width can be smaller than O(n) if the dataset is separable with a large enough margin. More specifically, the number of parameters required is inversely proportional to the margin by which the binary labelled dataset can be separated, thus for certain dataset the width can be as small as poly(log(n)).…”

Section: Relationship To Prior Workmentioning

confidence: 99%

“…Most works explore the NTK-type analysis to prove a generalization bound (e.g., [3,9]), but these works require a very large width, say, at least Ω(n 2 ). References [20,10] showed that the requirement for the width can be small if the data satisfies certain assumptions (i.e., the margin is large enough), but the width can still be large for other data distributions. One notable exception is [1], which used a different initialization (scaling by 1/m instead of 1/ √ m) and thus allow the trajectory to go beyond the kernel regime.…”

Section: Introductionmentioning

confidence: 99%

Achieving Small Test Error in Mildly Overparameterized Neural Networks

Liang,

Sun,

Srikant

2021

Preprint

View full text Add to dashboard Cite

Recent theoretical works on over-parameterized neural nets have focused on two aspects: optimization and generalization. Many existing works that study optimization and generalization together are based on neural tangent kernel and require a very large width. In this work, we are interested in the following question: for a binary classification problem with two-layer mildly over-parameterized ReLU network, can we find a point with small test error in polynomial time? We first show that the landscape of loss functions with explicit regularization has the following property: all local minima and certain other points which are only stationary in certain directions achieve small test error. We then prove that for convolutional neural nets, there is an algorithm which finds one of these points in polynomial time (in the input dimension and the number of data points). In addition, we prove that for a fully connected neural net, with an additional assumption on the data distribution, there is a polynomial time algorithm.

show abstract

Six lectures on linearized neural networks

Misiakiewicz,

Montanari

2024

J. Stat. Mech.

View full text Add to dashboard Cite

This tutorial examines what can be learnt about the behavior of multi-layer neural networks from the analysis of linear models. While there are important gaps between neural networks and their linear counterparts, many useful lessons can be learnt by studying the latter. A few preliminary remarks, before diving into the math: • We will not assume specific background in machine learning, let alone neural networks. On the other hand, we will assume some graduate-level mathematics, in particular probability theory (however, we will refer to the literature for complete proofs.) • Some of the notations that are used throughout the text will be summarized in appendix A. • We will keep bibliographic references in the main text to a minimum. A short guide to the literature is given in appendix B.

show abstract

Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks

Cited by 31 publications

References 16 publications

Understanding Square Loss in Training Overparametrized Neural Network Classifiers

Understanding Square Loss in Training Overparametrized Neural Network Classifiers

Achieving Small Test Error in Mildly Overparameterized Neural Networks

Six lectures on linearized neural networks

Contact Info

Product

Resources

About