On the Optimal Weighted $\ell_2$ Regularization in Overparameterized Linear Regression

Wu, Denny; Ji, Xu

doi:10.48550/arxiv.2006.05800

Cited by 7 publications

(7 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Termed "benign overfitting" by [2], this phenomenon has been studied analytically in the framework of linear regression with isotropic noise. It was extended to the anisotropic case later in [16], where the authors proposed the fruitful idea of "alignment" and "misalignment" between the signal and the covariance matrix of the noise.…”

Section: Related Workmentioning

confidence: 99%

Minimax Supervised Clustering in the Anisotropic Gaussian Mixture Model: A new take on Robust Interpolation

Minsker¹,

Ndaoud²,

Shen³

2021

Preprint

View full text Add to dashboard Cite

We study the supervised clustering problem under the twocomponent anisotropic Gaussian mixture model in high dimensions and in the non-asymptotic setting. We first derive a lower and a matching upper bound for the minimax risk of clustering in this framework. We also show that in the high-dimensional regime, the linear discriminant analysis (LDA) classifier turns out to be sub-optimal in the minimax sense. Next, we characterize precisely the risk of 2 -regularized supervised least squares classifiers. We deduce the fact that the interpolating solution may outperform the regularized classifier, under mild assumptions on the covariance structure of the noise. Our analysis also shows that interpolation can be robust to corruption in the covariance of the noise when the signal is aligned with the "clean" part of the covariance, for the properly defined notion of alignment. To the best of our knowledge, this peculiar phenomenon has not yet been investigated in the rapidly growing literature related to interpolation. We conclude that interpolation is not only benign but can also be optimal, and in some cases robust.

show abstract

Section: Related Workmentioning

confidence: 99%

Minimax Supervised Clustering in the Anisotropic Gaussian Mixture Model: A new take on Robust Interpolation

Minsker¹,

Ndaoud²,

Shen³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…7 While we were finishing this work, we became aware of parallel work by Wu and Xu [2020] who derive asymptotic risk predictions for ridge regression with general quadratic penalties that subsume the diagonal, group-wise constant penalty matrices that we consider. When specialized to our setting, their asymptotic risk formulae are less natural than ours as they are not phrased in terms of λ 1 , .…”

Section: Using a Single Regularization Parametermentioning

confidence: 99%

$σ$-Ridge: group regularized ridge regression via empirical Bayes noise level cross-validation

Ignatiadis,

Lolas

2020

Preprint

View full text Add to dashboard Cite

Features in predictive models are not exchangeable, yet common supervised models treat them as such. Here we study ridge regression when the analyst can partition the features into K groups based on external side-information. For example, in high-throughput biology, features may represent gene expression, protein abundance or clinical data and so each feature group represents a distinct modality. The analyst's goal is to choose optimal regularization parameters λ = (λ1, . . . , λK ) -one for each group. In this work, we study the impact of λ on the predictive risk of group-regularized ridge regression by deriving limiting risk formulae under a high-dimensional random effects model with p n as n → ∞. Furthermore, we propose a data-driven method for choosing λ that attains the optimal asymptotic risk: The key idea is to interpret the residual noise variance σ 2 , as a regularization parameter to be chosen through cross-validation. An empirical Bayes construction maps the one-dimensional parameter σ to the K-dimensional vector of regularization parameters, i.e., σ → λ(σ). Beyond its theoretical optimality, the proposed method is practical and runs as fast as cross-validated ridge regression without feature groups (K = 1). 2 2 /2 [Hoerl and Kennard, 1970,

show abstract

“…Existing research has been focused on studying the learning risk behavior of the interpolator (the estimator that interpolates the training data) under two regimes. The first regime investigates the asymptotic learning risk of the interpolator (Mei and Montanari, 2019;Hastie et al, 2019;Liao et al, 2020;Wu and Xu, 2020;Richards et al, 2021). These works derive the asymptotic learning risk by assuming that the data dimension d and number of training samples n (and number of parameters s in the non-linear regression case) grow simultaneously while their ratio is kept fixed.…”

Section: Introductionmentioning

confidence: 99%

Towards an Understanding of Benign Overfitting in Neural Networks

Li,

Zhou,

Gretton

2021

Preprint

View full text Add to dashboard Cite

Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss; yet surprisingly, they possess near-optimal prediction performance, contradicting classical learning theory. We examine how these benign overfitting phenomena occur in a two-layer neural network setting where sample covariates are corrupted with noise. We address the high dimensional regime, where the data dimension d grows with the number n of data points. Our analysis combines an upper bound on the bias with matching upper and lower bounds on the variance of the interpolator (an estimator that interpolates the data). These results indicate that the excess learning risk of the interpolator decays under mild conditions. We further show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate, which to our knowledge is the first generalization result for such networks. Finally, our theory predicts that the excess learning risk starts to increase once the number of parameters s grows beyond O(n 2 ), matching recent empirical findings.

show abstract

On the Optimal Weighted $\ell_2$ Regularization in Overparameterized Linear Regression

Cited by 7 publications

References 36 publications

Minimax Supervised Clustering in the Anisotropic Gaussian Mixture Model: A new take on Robust Interpolation

Minimax Supervised Clustering in the Anisotropic Gaussian Mixture Model: A new take on Robust Interpolation

$σ$-Ridge: group regularized ridge regression via empirical Bayes noise level cross-validation

Towards an Understanding of Benign Overfitting in Neural Networks

Contact Info

Product

Resources

About