The generalization error of max-margin linear classifiers: High-dimensional asymptotics in the overparametrized regime

Montanari, Andrea; Ruan, Feng; Youngtak, Sohn,; Yan, Jun

doi:10.48550/arxiv.1911.01544

Cited by 33 publications

(26 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These are the regularity assumptions for the feature matrix in [33]. The same intuition also appears in the analysis of the unperturbed random kernel models, in particular, the random feature model [18]. In this paper, we suppose that the feature matrix and the activation function satisfy the regularity assumptions in [33] and conjecture that the Gaussian equivalence is valid for (ν 1 , ν 2 , .…”

Section: Gaussian Equivalence Conjecture With An Intuitive Explanationmentioning

confidence: 65%

“…ϕ(•) is the identity function and ∆ = 0 in (3)) is precisely analyzed in [14] where the feature matrix is Gaussian. In a subsequent work, [18] uses the CGMT to accurately analyze the maximum-margin linear classifier in the overparametrized regime. The work in [15] precisely characterizes the performance of the standard formulation, i.e.…”

Section: Related Workmentioning

confidence: 99%

“…In the standard setting, i.e. ∆ = 0, the cGEC is equivalent to the uniform Gaussian equivalence theorem (uGET), observed and used in many earlier papers [15], [16], [18], [33]. Recently, the work in [17] provided a rigorous proof of the uGET.…”

Section: Gaussian Equivalence Conjecture With An Intuitive Explanationmentioning

confidence: 99%

“…where C ⋆ (∆, λ) is the optimal cost of the deterministic problem in (18). Here, the function h(•) is defined as follows…”

Section: B Noise Regularization Effectsmentioning

confidence: 99%

See 3 more Smart Citations

On the Inherent Regularization Effects of Noise Injection During Training

Dhifallah,

2021

Preprint

View full text Add to dashboard Cite

Randomly perturbing networks during the training process is a commonly used approach to improving generalization performance. In this paper, we present a theoretical study of one particular way of random perturbation, which corresponds to injecting artificial noise to the training data. We provide a precise asymptotic characterization of the training and generalization errors of such randomly perturbed learning problems on a random feature model. Our analysis shows that Gaussian noise injection in the training process is equivalent to introducing a weighted ridge regularization, when the number of noise injections tends to infinity. The explicit form of the regularization is also given. Numerical results corroborate our asymptotic predictions, showing that they are accurate even in moderate problem dimensions. Our theoretical predictions are based on a new correlated Gaussian equivalence conjecture that generalizes recent results in the study of random feature models.

show abstract

Section: Gaussian Equivalence Conjecture With An Intuitive Explanationmentioning

confidence: 65%

Section: Related Workmentioning

confidence: 99%

Section: Gaussian Equivalence Conjecture With An Intuitive Explanationmentioning

confidence: 99%

“…where C ⋆ (∆, λ) is the optimal cost of the deterministic problem in (18). Here, the function h(•) is defined as follows…”

Section: B Noise Regularization Effectsmentioning

confidence: 99%

See 2 more Smart Citations

On the Inherent Regularization Effects of Noise Injection During Training

Dhifallah,

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…When the number of parameters in the models is excessively large, there are multiple techniques to precisely measure generalization errors. To name a few, the spectrum-based analysis [45,46,47,48,49,50,51,52], and the utilization of loss functions whose shapes are almost convex or approaches zero due to the excess parameters [53,54,55]. A disadvantage of this approach is that until now it can only deal with linear or two-layer neural network models.…”

Section: Definition 2 (Population Minimummentioning

confidence: 99%

On generalization bounds for deep networks based on loss surface implicit regularization

Imaizumi¹,

Schmidt-Hieber²

2022

Preprint

View full text Add to dashboard Cite

RIKEN AIP A. The classical statistical learning theory says that fitting too many parameters leads to overfitting and poor performance. That modern deep neural networks generalize well despite a large number of parameters contradicts this finding and constitutes a major unsolved problem towards explaining the success of deep learning. The implicit regularization induced by stochastic gradient descent (SGD) has been regarded to be important, but its specific principle is still unknown. In this work, we study how the local geometry of the energy landscape around local minima affects the statistical properties of SGD with Gaussian gradient noise. We argue that under reasonable assumptions, the local geometry forces SGD to stay close to a low dimensional subspace and that this induces implicit regularization and results in tighter bounds on the generalization error for deep neural networks. To derive generalization error bounds for neural networks, we first introduce a notion of stagnation sets around the local minima and impose a local essential convexity property of the population risk. Under these conditions, lower bounds for SGD to remain in these stagnation sets are derived. If stagnation occurs, we derive a bound on the generalization error of deep neural networks involving the spectral norms of the weight matrices but not the number of network parameters. Technically, our proofs are based on controlling the change of parameter values in the SGD iterates and local uniform convergence of the empirical loss functions based on the entropy of suitable neighborhoods around local minima. Our work attempts to better connect non-convex optimization and generalization analysis with uniform convergence.

show abstract

Double data piling: a high-dimensional solution for asymptotically perfect multi-category classification

Kim,

Chang,

Ahn

et al. 2024

J. Korean Stat. Soc.

View full text Add to dashboard Cite

For high-dimensional classification, interpolation of training data manifests as the data piling phenomenon, in which linear projections of data vectors from each class collapse to a single value. Recent research has revealed an additional phenomenon known as the ‘second data piling’ for independent test data in binary classification, providing a theoretical understanding of asymptotically perfect classification. This paper extends these findings to multi-category classification and provides a comprehensive characterization of the double data piling phenomenon. We define the maximal data piling subspace, which maximizes the sum of pairwise distances between piles of training data in multi-category classification. Furthermore, we show that a second data piling subspace that induces data piling for independent data exists and can be consistently estimated by projecting the negatively-ridged discriminant subspace onto an estimated ‘signal’ subspace. By leveraging this second data piling phenomenon, we propose a bias-correction strategy for class assignments, which asymptotically achieves perfect classification. The present research sheds light on benign overfitting and enhances the understanding of perfect multi-category classification of high-dimensional discrimination with a help of high-dimensional asymptotics.

show abstract

The generalization error of max-margin linear classifiers: High-dimensional asymptotics in the overparametrized regime

Cited by 33 publications

References 38 publications

On the Inherent Regularization Effects of Noise Injection During Training

On the Inherent Regularization Effects of Noise Injection During Training

On generalization bounds for deep networks based on loss surface implicit regularization

Double data piling: a high-dimensional solution for asymptotically perfect multi-category classification

Contact Info

Product

Resources

About