2018
DOI: 10.48550/arxiv.1806.06763
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks

Abstract: Adaptive gradient methods, which adopt historical gradient information to automatically adjust the learning rate, have been observed to generalize worse than stochastic gradient descent (SGD) with momentum in training deep neural networks. This leaves how to close the generalization gap of adaptive gradient methods an open problem. In this work, we show that adaptive gradient methods such as Adam, Amsgrad, are sometimes "over adapted". We design a new algorithm, called Partially adaptive momentum estimation me… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

3
74
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
4

Relationship

0
8

Authors

Journals

citations
Cited by 50 publications
(77 citation statements)
references
References 18 publications
3
74
0
Order By: Relevance
“…From Table 1, we find out that Padam [5] reached the best f1 score in test data at epoch 14. As shown in Figure 1, we can observe that Adam [4] and AdaBelief [1] optimizer begin to show the sign of overfitting at epoch 3, their train losses are flattened and test losses start to fluctuate.…”
Section: Alexnetmentioning
confidence: 97%
See 2 more Smart Citations
“…From Table 1, we find out that Padam [5] reached the best f1 score in test data at epoch 14. As shown in Figure 1, we can observe that Adam [4] and AdaBelief [1] optimizer begin to show the sign of overfitting at epoch 3, their train losses are flattened and test losses start to fluctuate.…”
Section: Alexnetmentioning
confidence: 97%
“…Padam (partially adaptive momentum estimation) [5] is a modified version of Adam. It tries to close the generalization gap of adaptive gradient methods by introducing a partially adaptive parameter which also resolves the "small learning rate dilemma"(initial learning rate for adaptive methods often small) for adaptive methods and allows for faster convergence [5]. Padam is showed empirically that it achieves the fastest convergence speed while generalizing as well as SGD with momentum [5].…”
Section: Padammentioning
confidence: 99%
See 1 more Smart Citation
“…In the non-convex and smooth setting, [Ward et al, 2020] and [Li and Orabona, 2018] prove that the "norm" version of AdaGrad converges to a stationary point at rate O 1/ε 2 for stochastic GD and at rate O (1/ε) for batch GD. Many modifications to AdaGrad have been proposed, namely, RMSprop [Hinton et al, 2012], AdaDelta [Zeiler, 2012], Adam [Kingma and Ba, 2014], AdaFTRL [Orabona and Pál, 2015], SGD-BB [Tan et al, 2016], AcceleGrad [Levy et al, 2018], Yogi [Zaheer et al, 2018a], Padam [Chen and Gu, 2018], to name a few. More recently, accelerated adaptive gradient methods have also been proven to converge to stationary points [Barakat and Bianchi, 2018, Zaheer et al, 2018b, Zhou et al, 2018, Zou et al, 2018b.…”
Section: Introductionmentioning
confidence: 99%
“…Top-1 accuracy of ResNet18 on ImageNet. is reported in PyTorch Documentation, † is reported in[30], * is reported in[17], ‡ is reported in[18] SGD Adam AdamW RAdam AdaShift AdaBelief ACProp 69.76 (70.23 † ) 66.54 * 67.93 † 67.62 * 65.28 70.08 ‡ 70.46…”
mentioning
confidence: 99%