2019
DOI: 10.48550/arxiv.1908.03265
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

On the Variance of the Adaptive Learning Rate and Beyond

Abstract: The learning rate warmup heuristic achieves remarkable success in stabilizing training, accelerating convergence and improving generalization for adaptive stochastic optimization algorithms like RMSprop and Adam. Here, we study its mechanism in details. Pursuing the theory behind warmup, we identify a problem of the adaptive learning rate (i.e., it has problematically large variance in the early stage), suggest warmup works as a variance reduction technique, and provide both empirical and theoretical evidence … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
332
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3
1
1

Relationship

0
10

Authors

Journals

citations
Cited by 309 publications
(335 citation statements)
references
References 15 publications
2
332
1
Order By: Relevance
“…The above results can be extended to include the presence of a state-dependent,time-varying, learning rate [Zeiler, 2012, Kingma and Ba, 2014, Goodfellow et al, 2016, Liu et al, 2019.…”
Section: Adaptive Learning Ratesmentioning
confidence: 99%
“…The above results can be extended to include the presence of a state-dependent,time-varying, learning rate [Zeiler, 2012, Kingma and Ba, 2014, Goodfellow et al, 2016, Liu et al, 2019.…”
Section: Adaptive Learning Ratesmentioning
confidence: 99%
“…which is used as the target for optimisation [9]. Optimisation was performed using the Rectified ADAM (RADAM) algorithm [30,31] combined with early stopping on a separate validation data split. The parameters of the Rectified ADAM algorithm were varied depending on the input-output subset based on experiments with the validation set.…”
Section: Neural Network Optimisationmentioning
confidence: 99%
“…For E 2 , we modify the style block of original backbone to output 512×8×8 tensors instead of 512 vectors. In Phase I, we follow the previous encoder-based methods [3,38,45] and use the Ranger optimizer, which combines the Lookahead [55] and the Rectified Adam [31] optimizer, for training. In Phase II, we use Adam [28] with standard settings, which we found to make the training of the hypernetworks converges faster.…”
Section: Experiments 41 Experimental Settingsmentioning
confidence: 99%