Implicit Regularization in Matrix Factorization

Gunasekar, Suriya; Woodworth, Blake; Bhojanapalli, Srinadh; Neyshabur, Behnam; Srebro, Nathan

doi:10.1109/ita.2018.8503198

Cited by 206 publications

(352 citation statements)

References 8 publications

Supporting

Mentioning

332

Contrasting

Order By: Relevance

“…Nevertheless, as shown in Figure 4, we observe that increasing the number of parameters in fully connected two-layer neural networks leads to a risk curve qualitatively similar to that observed with RFF models. That the test risk improves beyond the interpolation threshold is compatible with the conjectured "small norm" inductive biases of the common training algorithms for neural networks [20,25]. We note that this transition from under-to over-parameterized regimes for neural networks was also previously observed by [7, 1,27,37].…”

Section: Neural Networksupporting

confidence: 85%

“…When all but the final layer of the network are fixed (as in RFF models), SGD initialized at zero also converges to the minimum norm solution. While the behavior of SGD for more general neural networks is not fully understood, there is significant empirical and some theoretical evidence (e.g., [20]) that a similar minimum norm inductive bias is present. Yet another type of inductive bias related to averaging is used in random forests.…”

Section: Concluding Thoughtsmentioning

confidence: 99%

See 1 more Smart Citation

Reconciling modern machine-learning practice and the classical bias–variance trade-off

Belkin

Hsu

et al. 2019

Proc. Natl. Acad. Sci. U.S.A.

1,134

928

View full text Add to dashboard Cite

Breakthroughs in machine learning are rapidly changing science and society, yet our fundamental understanding of this technology has lagged far behind. Indeed, one of the central tenets of the field, the bias-variance trade-off, appears to be at odds with the observed behavior of methods used in the modern machine learning practice. The bias-variance trade-off implies that a model should balance under-fitting and over-fitting: rich enough to express underlying structure in data, simple enough to avoid fitting spurious patterns. However, in the modern practice, very rich models such as neural networks are trained to exactly fit (i.e., interpolate) the data. Classically, such models would be considered over-fit, and yet they often obtain high accuracy on test data. This apparent contradiction has raised questions about the mathematical foundations of machine learning and their relevance to practitioners.In this paper, we reconcile the classical understanding and the modern practice within a unified performance curve. This "double descent" curve subsumes the textbook U-shaped biasvariance trade-off curve by showing how increasing model capacity beyond the point of interpolation results in improved performance. We provide evidence for the existence and ubiquity of double descent for a wide spectrum of models and datasets, and we posit a mechanism for its emergence. This connection between the performance and the structure of machine learning models delineates the limits of classical analyses, and has implications for both the theory and practice of machine learning.

show abstract

Section: Neural Networksupporting

confidence: 85%

Section: Concluding Thoughtsmentioning

confidence: 99%

Reconciling modern machine-learning practice and the classical bias–variance trade-off

Belkin

Hsu

et al. 2019

Proc. Natl. Acad. Sci. U.S.A.

1,134

928

View full text Add to dashboard Cite

show abstract

“…Corollary. Let x ∈ R k 0 with x 2 = 1 and f (z ) = xu T φ(vx T z ), where u, v ∈ R k and φ is a smooth elementwise nonlinearity with We note that while the linear setting with φ(z ) = z has been studied extensively using gradient flow (35)(36)(37), our results extend to the nonlinear setting.…”

Section: Proof That When Trained On a Single Example Overparameterizedmentioning

confidence: 83%

Overparameterized neural networks implement associative memory

Radhakrishnan

Belkin

Uhler

2020

Proc. Natl. Acad. Sci. U.S.A.

View full text Add to dashboard Cite

Identifying computational mechanisms for memorization and retrieval of data is a long-standing problem at the intersection of machine learning and neuroscience. Our main finding is that standard overparameterized deep neural networks trained using standard optimization methods implement such a mechanism for real-valued data. We provide empirical evidence that 1) overparameterized autoencoders store training samples as attractors and thus iterating the learned map leads to sample recovery, and that 2) the same mechanism allows for encoding sequences of examples and serves as an even more efficient mechanism for memory than autoencoding. Theoretically, we prove that when trained on a single example, autoencoders store the example as an attractor. Lastly, by treating a sequence encoder as a composition of maps, we prove that sequence encoding provides a more efficient mechanism for memory than autoencoding.

show abstract

“…Finally, we note that the notion of implicit regularization-broadly defined-arises in settings far beyond the models and algorithms considered herein. For instance, it has been conjectured that in matrix factorization, over-parameterized stochastic gradient descent effectively enforces certain norm constraints, allowing it to converge to a minimal-norm solution as long as it starts from the origin [52]. The stochastic gradient methods have also been shown to implicitly enforce Tikhonov regularization in several statistical learning settings [80].…”

Section: Related Workmentioning

confidence: 99%

Implicit Regularization in Nonconvex Statistical Estimation: Gradient Descent Converges Linearly for Phase Retrieval, Matrix Completion, and Blind Deconvolution

et al. 2019

View full text Add to dashboard Cite

Recent years have seen a flurry of activities in designing provably efficient nonconvex procedures for solving statistical estimation problems. Due to the highly nonconvex nature of the empirical loss, state-of-the-art procedures often require proper regularization (e.g., trimming, regularized cost, projection) in order to guarantee fast convergence. For vanilla procedures such as gradient descent, however, prior theory either recommends highly conservative learning rates to avoid overshooting, or completely lacks performance guarantees. This paper uncovers a striking phenomenon in nonconvex optimization: even in the absence of explicit regularization, gradient descent enforces proper regularization implicitly under various statistical models. In fact, gradient descent follows a trajectory staying within a basin that enjoys nice geometry, consisting of points incoherent with the sampling mechanism. This "implicit regularization" feature allows gradient descent to proceed in a far more aggressive fashion without overshooting, which in turn results in substantial computational savings. Focusing on three fundamental statistical estimation problems, i.e., phase retrieval, low-rank matrix completion, and blind deconvolution, we establish that gradient descent achieves near-optimal statistical and computational guarantees without explicit regularization. In particular, by marrying statistical modeling with generic optimization theory, we develop a general recipe for analyzing the trajectories of iterative algorithms via a leave-one-out perturbation argument. As a by-product, for noisy matrix completion, we demonstrate that gradient descent achieves near-optimal error control-measured entrywise and by the spectral norm-which might be of independent interest.

show abstract

Implicit Regularization in Matrix Factorization

Cited by 206 publications

References 8 publications

Reconciling modern machine-learning practice and the classical bias–variance trade-off

Reconciling modern machine-learning practice and the classical bias–variance trade-off

Overparameterized neural networks implement associative memory

Implicit Regularization in Nonconvex Statistical Estimation: Gradient Descent Converges Linearly for Phase Retrieval, Matrix Completion, and Blind Deconvolution

Contact Info

Product

Resources

About