2022
DOI: 10.48550/arxiv.2208.06677
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models

Abstract: Adaptive gradient algorithms [1][2][3][4] borrow the moving average idea of heavy ball acceleration to estimate accurate first-and second-order moments of gradient for accelerating convergence. However, Nesterov acceleration which converges faster than heavy ball acceleration in theory [5] and also in many empirical cases [6] is much less investigated under the adaptive gradient setting. In this work, we propose the ADAptive Nesterov momentum algorithm, Adan for short, to effectively speedup the training of de… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
38
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
7
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 28 publications
(38 citation statements)
references
References 35 publications
0
38
0
Order By: Relevance
“…RAdam [1] tries to correct the adaptive learning rate to maintain a constant variance. Adamp [9] modifies the practical step sizes to prevent the weight standard from increasing, and Adan [10] introduces a Nesterov Momentum Estimation (NME) method to reduce training cost and improve performance.…”
Section: Publication Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…RAdam [1] tries to correct the adaptive learning rate to maintain a constant variance. Adamp [9] modifies the practical step sizes to prevent the weight standard from increasing, and Adan [10] introduces a Nesterov Momentum Estimation (NME) method to reduce training cost and improve performance.…”
Section: Publication Methodsmentioning
confidence: 99%
“…Batch Normalization [127] and its variants use the mean and variance of historical statistics computed through EMA to standardize the data. Besides, leveraging historical feature representations [107], [108], [115], network parameters [34]- [39], [60], and gradients [1], [9], [10] by EMA give more weight and importance to the most recent data points while still tracking a portion of the history.…”
Section: Aspect Of Storage Formmentioning
confidence: 99%
“…In [69] proposed the Adaptive Nesterov momentum algorithm is devoted to effectively accelerate the training of deep neural networks. Adan first reformulates the vanilla Nesterov acceleration to develop a new Nesterov momentum estimation method, which reduces extra computations and memory overhead of computing gradient at the extrapolation point.…”
Section: Positive-negative Momentummentioning
confidence: 99%
“…2) Experimental Setting: We divide the UCM dataset into a training set and a testing set randomly according to a specific ratio (1:99, 1:9, 3:7, 8:2). We use the Adan optimizer [75] with a cosine learning rate scheduler and train for 200 epochs. 1 The results are evaluated for each backbone in NVIDIA 3080 GPU using the THOP library.…”
Section: Scene Classification 1) Uc Merced Land Use Datasetmentioning
confidence: 99%