2018 Information Theory and Applications Workshop (ITA) 2018
DOI: 10.1109/ita.2018.8503198
|View full text |Cite
|
Sign up to set email alerts
|

Implicit Regularization in Matrix Factorization

Abstract: We study implicit regularization when optimizing an underdetermined quadratic objective over a matrix X with gradient descent on a factorization of X. We conjecture and provide empirical and theoretical evidence that with small enough step sizes and initialization close enough to the origin, gradient descent on a full dimensional factorization converges to the minimum nuclear norm solution.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

14
332
2

Year Published

2019
2019
2024
2024

Publication Types

Select...
6
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 206 publications
(352 citation statements)
references
References 8 publications
14
332
2
Order By: Relevance
“…Nevertheless, as shown in Figure 4, we observe that increasing the number of parameters in fully connected two-layer neural networks leads to a risk curve qualitatively similar to that observed with RFF models. That the test risk improves beyond the interpolation threshold is compatible with the conjectured "small norm" inductive biases of the common training algorithms for neural networks [20,25]. We note that this transition from under-to over-parameterized regimes for neural networks was also previously observed by [7, 1,27,37].…”
Section: Neural Networksupporting
confidence: 85%
See 1 more Smart Citation
“…Nevertheless, as shown in Figure 4, we observe that increasing the number of parameters in fully connected two-layer neural networks leads to a risk curve qualitatively similar to that observed with RFF models. That the test risk improves beyond the interpolation threshold is compatible with the conjectured "small norm" inductive biases of the common training algorithms for neural networks [20,25]. We note that this transition from under-to over-parameterized regimes for neural networks was also previously observed by [7, 1,27,37].…”
Section: Neural Networksupporting
confidence: 85%
“…When all but the final layer of the network are fixed (as in RFF models), SGD initialized at zero also converges to the minimum norm solution. While the behavior of SGD for more general neural networks is not fully understood, there is significant empirical and some theoretical evidence (e.g., [20]) that a similar minimum norm inductive bias is present. Yet another type of inductive bias related to averaging is used in random forests.…”
Section: Concluding Thoughtsmentioning
confidence: 99%
“…Corollary. Let x ∈ R k 0 with x 2 = 1 and f (z ) = xu T φ(vx T z ), where u, v ∈ R k and φ is a smooth elementwise nonlinearity with We note that while the linear setting with φ(z ) = z has been studied extensively using gradient flow (35)(36)(37), our results extend to the nonlinear setting.…”
Section: Proof That When Trained On a Single Example Overparameterizedmentioning
confidence: 83%
“…Finally, we note that the notion of implicit regularization-broadly defined-arises in settings far beyond the models and algorithms considered herein. For instance, it has been conjectured that in matrix factorization, over-parameterized stochastic gradient descent effectively enforces certain norm constraints, allowing it to converge to a minimal-norm solution as long as it starts from the origin [52]. The stochastic gradient methods have also been shown to implicitly enforce Tikhonov regularization in several statistical learning settings [80].…”
Section: Related Workmentioning
confidence: 99%