2020
DOI: 10.48550/arxiv.2002.03305
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Momentum Improves Normalized SGD

Abstract: We provide an improved analysis of normalized SGD showing that adding momentum provably removes the need for large batch sizes on non-convex objectives. Then, we consider the case of objectives with bounded second derivative and show that in this case a small tweak to the momentum formula allows normalized SGD with momentum to find an -critical point in O(1/ 3.5 ) iterations, matching the best-known rates without accruing any logarithmic factors or dependence on dimension. We also provide an adaptive method th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
16
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
5

Relationship

1
4

Authors

Journals

citations
Cited by 5 publications
(17 citation statements)
references
References 11 publications
1
16
0
Order By: Relevance
“…Since training modern networks is quite expensive (Raffel et al, 2020;Brown et al, 2020), it is important to be confident that each and every training run will produce a high-quality result. We provide a variant of the normalized SGD algorithm of Cutkosky & Mehta (2020) that incorporates gradient clipping and show that the method finds an -critical point in Õ − 3p−2 p−1 iterations with high probability. Second, the vast majority of first-order optimization methods for stochastic non-convex optimization consider exclusively the standard L 2 or Hilbert-space norm.…”
Section: Sgd and Heavy Tailsmentioning
confidence: 99%
See 4 more Smart Citations
“…Since training modern networks is quite expensive (Raffel et al, 2020;Brown et al, 2020), it is important to be confident that each and every training run will produce a high-quality result. We provide a variant of the normalized SGD algorithm of Cutkosky & Mehta (2020) that incorporates gradient clipping and show that the method finds an -critical point in Õ − 3p−2 p−1 iterations with high probability. Second, the vast majority of first-order optimization methods for stochastic non-convex optimization consider exclusively the standard L 2 or Hilbert-space norm.…”
Section: Sgd and Heavy Tailsmentioning
confidence: 99%
“…Third, recent theoretical advances in non-convex optimization have produced a number of new algorithms that avoid the lower-bounds of (Arjevani et al, 2019) by assuming extra structure on the objective F , such as second-order smoothness. In this case, (Tripuraneni et al, 2018;Allen-Zhu, 2018;Fang et al, 2019;Cutkosky & Mehta, 2020;Arjevani et al, 2020) provide algorithms that achieve faster convergence rates. In particular, the algorithms of (Fang et al, 2019;Cutkosky & Mehta, 2020) can find an -critical point in O( 3.5 ) iterations using a first-order stochastic oracle, which is the optimal rate (Arjevani et al, 2020).…”
Section: Sgd and Heavy Tailsmentioning
confidence: 99%
See 3 more Smart Citations