2019
DOI: 10.48550/arxiv.1904.12838
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Procedure For Least Squares

Abstract: Minimax optimal convergence rates for numerous classes of stochastic convex optimization problems are well characterized, where the majority of results utilize iterate averaged stochastic gradient descent (SGD) with polynomially decaying step sizes. In contrast, the behavior of SGDs final iterate has received much less attention despite the widespread use in practice. Motivated by this observation, this work provides a detailed study of the following question: what rate is achievable using the final iterate of… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
18
1

Year Published

2019
2019
2022
2022

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 18 publications
(19 citation statements)
references
References 24 publications
0
18
1
Order By: Relevance
“…3. The red line is an "adaptive" gradient descent method [20,21] whereby we start out with α = 1 but successively decrease it by a geometric factor β whenever an update of the parameters causes a decrease in the objective function. Note that there is still a hyperparameter here in choosing the size of the geometric decay factor β.…”
Section: Resultsmentioning
confidence: 99%
“…3. The red line is an "adaptive" gradient descent method [20,21] whereby we start out with α = 1 but successively decrease it by a geometric factor β whenever an update of the parameters causes a decrease in the objective function. Note that there is still a hyperparameter here in choosing the size of the geometric decay factor β.…”
Section: Resultsmentioning
confidence: 99%
“…The bounded gradient is generally assumed in the non-convex/convex convergence analysis of SGD (Nesterov, 2003;Reddi et al, 2016). And the learning rate schedule is necessary for the analysis of SGD to decay its constant gradient variance (Ge et al, 2019).…”
Section: Training Dynamic Of Gcnmentioning
confidence: 99%
“…We use a popular step decay learning rate schedule as our baseline [20], which is used in the open source ResNet implementation [19]. It contains three components: initial learning rate, discount step, and discount factor.…”
Section: Baseline Learning Rate Schedulementioning
confidence: 99%
“…The baseline schedule starts from the initial learning rate, then it decreases by the discount factor every discount steps. In the baseline experiments, we test all combinations from the initial learning rate in [0.1, 0.01, 0.001, 0.0001], the discount step in [10,20,50,100], and the discount factor in [0.99, 0.9, 0.88]. After choosing the best baseline schedule, we run it 10 times with the same set of hyper-parameters and report mean and standard deviation of test loss and accuracy.…”
Section: Baseline Learning Rate Schedulementioning
confidence: 99%