Mei Song scite author profile

SignificanceMultilayer neural networks have proven extremely successful in a variety of tasks, from image classification to robotics. However, the reasons for this practical success and its precise domain of applicability are unknown. Learning a neural network from data requires solving a complex optimization problem with millions of variables. This is done by stochastic gradient descent (SGD) algorithms. We study the case of two-layer networks and derive a compact description of the SGD dynamics in terms of a limiting partial differential equation. Among other consequences, this shows that SGD dynamics does not become more complex when the network size increases.

show abstract

The generalization error of random features regression: Precise asymptotics and double descent curve

Song¹,

Montanari²

2019

Preprint

133

211

View full text Add to dashboard Cite

Deep learning methods operate in regimes that defy the traditional statistical mindset. Neural network architectures often contain more parameters than training samples, and are so rich that they can interpolate the observed labels, even if the latter are replaced by pure noise. Despite their huge complexity, the same architectures achieve small generalization error on real data.This phenomenon has been rationalized in terms of a so-called 'double descent' curve. As the model complexity increases, the test error follows the usual U-shaped curve at the beginning, first decreasing and then peaking around the interpolation threshold (when the model achieves vanishing training error). However, it descends again as model complexity exceeds this threshold. The global minimum of the test error is found above the interpolation threshold, often in the extreme overparametrization regime in which the number of parameters is much larger than the number of samples. Far from being a peculiar property of deep neural networks, elements of this behavior have been demonstrated in much simpler settings, including linear regression with random covariates.In this paper we consider the problem of learning an unknown function over the d-dimensional sphereWe perform ridge regression on N random features of the form σ(w T a x), a ≤ N . This can be equivalently described as a two-layers neural network with random first-layer weights. We compute the precise asymptotics of the test error, in the limit N, n, d → ∞ with N/d and n/d fixed. This provides the first analytically tractable model that captures all the features of the double descent phenomenon without assuming ad hoc misspecification structures. In particular, above a critical value of the signal-to-noise ratio, minimum test error is achieved by extremely overparametrized interpolators, i.e., networks that have a number of parameters much larger than the sample size, and vanishing training error.

show abstract

The landscape of empirical risk for nonconvex losses

Song¹,

Bai²,

Montanari³

2018

Ann. Statist.

175

215

View full text Add to dashboard Cite

Most high-dimensional estimation and prediction methods propose to minimize a cost function (empirical risk) that is written as a sum of losses associated to each data point (each example). In this paper we focus on the case of non-convex losses, which is practically important but still poorly understood. Classical empirical process theory implies uniform convergence of the empirical (or sample) risk to the population risk. While -under additional assumptionsuniform convergence implies consistency of the resulting M-estimator, it does not ensure that the latter can be computed efficiently.In order to capture the complexity of computing M-estimators, we propose to study the landscape of the empirical risk, namely its stationary points and their properties. We establish uniform convergence of the gradient and Hessian of the empirical risk to their population counterparts, as soon as the number of samples becomes larger than the number of unknown parameters (modulo logarithmic factors). Consequently, good properties of the population risk can be carried to the empirical risk, and we are able to establish one-to-one correspondence of their stationary points. We demonstrate that in several problems such as non-convex binary classification, robust regression, and Gaussian mixture model, this result implies a complete characterization of the landscape of the empirical risk, and of the convergence properties of descent algorithms.We extend our analysis to the very high-dimensional setting in which the number of parameters exceeds the number of samples, and provide a characterization of the empirical risk landscape under a nearly information-theoretically minimal condition. Namely, if the number of samples exceeds the sparsity of the unknown parameters vector (modulo logarithmic factors), then a suitable uniform convergence result takes place. We apply this result to non-convex binary classification and robust regression in very high-dimension.

show abstract

User Scheduling and Resource Allocation in HetNets With Hybrid Energy Supply: An Actor-Critic Reinforcement Learning Approach

Wei

Song

et al. 2018

IEEE Trans. Wireless Commun.

286

135

View full text Add to dashboard Cite

The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve

Song

Montanari

2021

Comm Pure Appl Math

148

116

View full text Add to dashboard Cite

Deep learning methods operate in regimes that defy the traditional statistical mindset. Neural network architectures often contain more parameters than training samples, and are so rich that they can interpolate the observed labels, even if the latter are replaced by pure noise. Despite their huge complexity, the same architectures achieve small generalization error on real data.This phenomenon has been rationalized in terms of a so-called 'double descent' curve. As the model complexity increases, the test error follows the usual U-shaped curve at the beginning, first decreasing and then peaking around the interpolation threshold (when the model achieves vanishing training error). However, it descends again as model complexity exceeds this threshold. The global minimum of the test error is found above the interpolation threshold, often in the extreme overparametrization regime in which the number of parameters is much larger than the number of samples. Far from being a peculiar property of deep neural networks, elements of this behavior have been demonstrated in much simpler settings, including linear regression with random covariates.In this paper we consider the problem of learning an unknown function over the d -dimensional sphere S d 1 , from n i.i.d. samples .x i ; y i / P S d 1 ¢ R, i n. We perform ridge regression on N random features of the form .w T a x/, a N . This can be equivalently described as a two-layer neural network with random first-layer weights. We compute the precise asymptotics of the test error, in the limit N; n; d 3 I with N=d and n=d fixed. This provides the first analytically tractable model that captures all the features of the double descent phenomenon without assuming ad hoc misspecification structures. In particular, above a critical value of the signal-to-noise ratio, minimum test error is achieved by extremely overparametrized interpolators, i.e., networks that have a number of parameters much larger than the sample size, and vanishing training error.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Mei Song

A mean field view of the landscape of two-layer neural networks

The generalization error of random features regression: Precise asymptotics and double descent curve

The landscape of empirical risk for nonconvex losses

User Scheduling and Resource Allocation in HetNets With Hybrid Energy Supply: An Actor-Critic Reinforcement Learning Approach

The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve

Contact Info

Product

Resources

About