Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks

Chen, Jinghui; Zhou, Dongruo; Tang, Yiqi; Yang, Ziyan; Cao, Yuan; Gu, Quanquan

doi:10.48550/arxiv.1806.06763

Cited by 50 publications

(77 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…From Table 1, we find out that Padam [5] reached the best f1 score in test data at epoch 14. As shown in Figure 1, we can observe that Adam [4] and AdaBelief [1] optimizer begin to show the sign of overfitting at epoch 3, their train losses are flattened and test losses start to fluctuate.…”

Section: Alexnetmentioning

confidence: 97%

“…Padam (partially adaptive momentum estimation) [5] is a modified version of Adam. It tries to close the generalization gap of adaptive gradient methods by introducing a partially adaptive parameter which also resolves the "small learning rate dilemma"(initial learning rate for adaptive methods often small) for adaptive methods and allows for faster convergence [5]. Padam is showed empirically that it achieves the fastest convergence speed while generalizing as well as SGD with momentum [5].…”

Section: Padammentioning

confidence: 99%

“…Adam [4] is a stochastic optimization algorithm applied widely to train deep neural networks, it has the advantages of RMSProp [10], Momentum, and incorporates adaptive learning rate for learning different parameters. Recently, AdaBelief [1] and Padam [5] are introduced among the community. These two algorithms are proposed to improve the performance of the original Adam optimizer in certain situations.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Effectiveness of Optimization Algorithms in Deep Image Classification

Zhu,

Sun,

Zhang

2021

Preprint

View full text Add to dashboard Cite

Adam [4] is applied widely to train neural networks. Different kinds of Adam methods with different features pop out. Recently two new adam optimizers, AdaBelief[1] and Padam[5] are introduced among the community. We analyze these two adam optimizers and compare them with other conventional optimizers (Adam, SGD + Momentum) in the scenario of image classification. We evaluate the performance of these optimization algorithms on AlexNet[8] and simplified versions of VGGNet[7], ResNet[9] using the EMNIST[6] dataset. (Benchmark algorithm is available at https://github.com/chuiyunjun/projectCSC413).Preprint. Under review.

show abstract

Section: Alexnetmentioning

confidence: 97%

Section: Padammentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Effectiveness of Optimization Algorithms in Deep Image Classification

Zhu,

Sun,

Zhang

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…In the non-convex and smooth setting, [Ward et al, 2020] and [Li and Orabona, 2018] prove that the "norm" version of AdaGrad converges to a stationary point at rate O 1/ε 2 for stochastic GD and at rate O (1/ε) for batch GD. Many modifications to AdaGrad have been proposed, namely, RMSprop [Hinton et al, 2012], AdaDelta [Zeiler, 2012], Adam [Kingma and Ba, 2014], AdaFTRL [Orabona and Pál, 2015], SGD-BB [Tan et al, 2016], AcceleGrad [Levy et al, 2018], Yogi [Zaheer et al, 2018a], Padam [Chen and Gu, 2018], to name a few. More recently, accelerated adaptive gradient methods have also been proven to converge to stationary points [Barakat and Bianchi, 2018, Zaheer et al, 2018b, Zhou et al, 2018, Zou et al, 2018b.…”

Section: Introductionmentioning

confidence: 99%

AdaLoss: A computationally-efficient and provably convergent adaptive gradient method

Wu¹,

Xie²,

Du³

et al. 2021

Preprint

View full text Add to dashboard Cite

We propose a computationally-friendly adaptive learning rate schedule, "AdaLoss", which directly uses the information of the loss function to adjust the stepsize in gradient descent methods. We prove that this schedule enjoys linear convergence in linear regression. Moreover, we provide a linear convergence guarantee over the non-convex regime, in the context of two-layer over-parameterized neural networks. If the width of the first-hidden layer in the two-layer networks is sufficiently large (polynomially), then AdaLoss converges robustly to the global minimum in polynomial time. We numerically verify the theoretical results and extend the scope of the numerical experiments by considering applications in LSTM models for text clarification and policy gradients for control problems.

show abstract

“…Top-1 accuracy of ResNet18 on ImageNet. is reported in PyTorch Documentation, † is reported in[30], * is reported in[17], ‡ is reported in[18] SGD Adam AdamW RAdam AdaShift AdaBelief ACProp 69.76 (70.23 † ) 66.54 * 67.93 † 67.62 * 65.28 70.08 ‡ 70.46…”

mentioning

confidence: 99%

Momentum Centering and Asynchronous Update for Adaptive Gradient Methods

Zhuang¹,

Ding²,

Tang³

et al. 2021

Preprint

View full text Add to dashboard Cite

We propose ACProp (Asynchronous-centering-Prop), an adaptive optimizer which combines centering of second momentum and asynchronous update (e.g. for t-th update, denominator uses information up to step t − 1, while numerator uses gradient at t-th step). ACProp has both strong theoretical properties and empirical performance. With the example by Reddi et al. ( 2018), we show that asynchronous optimizers (e.g. AdaShift, ACProp) have weaker convergence condition than synchronous optimizers (e.g. Adam, RMSProp, AdaBelief); within asynchronous optimizers, we show that centering of second momentum further weakens the convergence condition. We demonstrate that ACProp has a convergence rate of O( 1 √ T ) for the stochastic non-convex case, which matches the oracle rate and outperforms the O( logT √ T ) rate of RMSProp and Adam. We validate ACProp in extensive empirical studies: ACProp outperforms both SGD and other adaptive optimizers in image classification with CNN, and outperforms well-tuned adaptive optimizers in the training of various GAN models, reinforcement learning and transformers. To sum up, ACProp has good theoretical properties including weak convergence condition and optimal convergence rate, and strong empirical performance including good generalization like SGD and training stability like Adam. We provide the implementation at https://github.com/juntang-zhuang/ACProp-Optimizer.

show abstract

Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks

Cited by 50 publications

References 18 publications

Effectiveness of Optimization Algorithms in Deep Image Classification

Effectiveness of Optimization Algorithms in Deep Image Classification

AdaLoss: A computationally-efficient and provably convergent adaptive gradient method

Momentum Centering and Asynchronous Update for Adaptive Gradient Methods

Contact Info

Product

Resources

About