2021
DOI: 10.1137/19m128908x
|View full text |Cite
|
Sign up to set email alerts
|

Making the Last Iterate of SGD Information Theoretically Optimal

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

1
46
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 26 publications
(48 citation statements)
references
References 7 publications
1
46
1
Order By: Relevance
“…Nevertheless, their lower bound analysis relies on a construction with dimension d = T . Jain et al [2019] used a sophisticated but non-standard step size schedule to achieve an optimal convergence rate for the final iterate of SGD.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Nevertheless, their lower bound analysis relies on a construction with dimension d = T . Jain et al [2019] used a sophisticated but non-standard step size schedule to achieve an optimal convergence rate for the final iterate of SGD.…”
Section: Related Workmentioning
confidence: 99%
“…There is a line of works making attempts to understand the convergence rate of the final iterate of SGD. Shamir and Zhang [2013] first established a near-optimal O(log T / √ T ) convergence rate for the final iterate of SGD with a step size schedule η t = 1/ √ t. Jain et al [2019] proved an information-theoretically optimal O(1/ √ T ) upper bound using a rather non-standard step size schedule. Harvey et al [2019] gave an Ω(log T / √ T ) lower bound for the standard η t = 1/ √ t step size schedule, but their construction requires the dimension d to be equal to T , which is quite restrictive.…”
Section: Introductionmentioning
confidence: 99%
“…The variance problem introduced by the stochastic nature of the SGD algorithm becomes the main problem of optimization algorithms nowadays. The introduction of variance makes SGD reach only sublinear convergence speed with a fixed step size [1], while the stochastic algorithm accuracy is positively related to the sampling variance, and when the variance tends to 0, the deviation of the algorithm will also be 0. In this case, the SGD can still be fast even with a large step size convergence.…”
Section: Introductionmentioning
confidence: 99%
“…Numerous strategies have been proposed to set the step-size, either based on theory [33,19,23,13] or empirical evidence [11,25,24,31,39,41,28,47]. Generally speaking, if the gradient is subject to additive noise, a constant step-size can promote a fast converge [18,38] to the ball around a stationary point.…”
Section: Introductionmentioning
confidence: 99%