Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models

Xie, Xiaorong; Zhou, Pan; Li, Huan; Lin, Zhouchen; Yan, Shuicheng

doi:10.48550/arxiv.2208.06677

Cited by 28 publications

(38 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…RAdam [1] tries to correct the adaptive learning rate to maintain a constant variance. Adamp [9] modifies the practical step sizes to prevent the weight standard from increasing, and Adan [10] introduces a Nesterov Momentum Estimation (NME) method to reduce training cost and improve performance.…”

Section: Publication Methodsmentioning

confidence: 99%

“…Batch Normalization [127] and its variants use the mean and variance of historical statistics computed through EMA to standardize the data. Besides, leveraging historical feature representations [107], [108], [115], network parameters [34]- [39], [60], and gradients [1], [9], [10] by EMA give more weight and importance to the most recent data points while still tracking a portion of the history.…”

Section: Aspect Of Storage Formmentioning

confidence: 99%

See 1 more Smart Citation

A Survey of Historical Learning: Learning Models with Learning History

Li¹,

Ge²,

Yang³

et al. 2023

Preprint

View full text Add to dashboard Cite

New knowledge originates from the old. The various types of elements, deposited in the training history, are a large amount of wealth for improving learning deep models. In this survey, we comprehensively review and summarize the topic-"Historical Learning: Learning Models with Learning History", which learns better neural models with the help of their learning history during its optimization, from three detailed aspects: Historical Type (what), Functional Part (where) and Storage Form (how). To our best knowledge, it is the first survey that systematically studies the methodologies which make use of various historical statistics when training deep neural networks. The discussions with related topics like recurrent/memory networks, ensemble learning, and reinforcement learning are demonstrated. We also expose future challenges of this topic and encourage the community to pay attention to the think of historical learning principles when designing algorithms. The paper list related to historical learning is available at https://github.com/Martinser/Awesome-Historical-Learning.

show abstract

Section: Publication Methodsmentioning

confidence: 99%

Section: Aspect Of Storage Formmentioning

confidence: 99%

A Survey of Historical Learning: Learning Models with Learning History

Li¹,

Ge²,

Yang³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…In [69] proposed the Adaptive Nesterov momentum algorithm is devoted to effectively accelerate the training of deep neural networks. Adan first reformulates the vanilla Nesterov acceleration to develop a new Nesterov momentum estimation method, which reduces extra computations and memory overhead of computing gradient at the extrapolation point.…”

Section: Positive-negative Momentummentioning

confidence: 99%

Survey of Optimization Algorithms in Modern Neural Networks

Abdulkadirov¹,

Lyakhov²,

Nagornov³

2023

Preprint

View full text Add to dashboard Cite

Creating self-learning algorithms, developing deep neural networks and improving other methods that "learn" for various areas of human activity is the main goal of the theory of machine learning. It helps to replace the human with a machine, aiming to increase the quality of production. The theory of artificial neural networks, which already have replaced the humans in problems of detection of moving objects, recognition of images or sounds, time series prediction, big data analysis and numerical methods remains the most dispersed branch of the theory of machine learning. Certainly, for each area of human activity it is necessary to select appropriate neural network architectures, methods of data processing and some novel tools from applied mathematics. But the universal problem for all these neural networks with specific data is the achieving the highest accuracy in short time. Such problem can be resolved by increasing sizes of architectures and improving data preprocessing, where the accuracy rises with the training time. But there is a possibility to increase the accuracy without time growing, applying optimization methods. In this survey we demonstrate existing optimization algorithms of all types, which can be used in neural networks. There are presented modifications of basic optimization algorithms, such as stochastic gradient descent, adaptive moment estimation, Newton and quasi-Newton optimization methods. But the most recent optimization algorithms are related to information geometry, for Fisher-Rao and Bregman metrics. This approach in optimization extended the theory of classic neural networks to quantum and complex-valued neural networks, due to geometric and probabilistic tools. There are provided applications of all introduced optimization algorithms, what delighted many kinds of neural networks, which can be improved by including any advanced approaches in minimization of the loss function. Afterwards, we demonstrated ways of developing optimization algorithms in further researches, engaging neural networks with progressive architectures. Classical gradient based optimizers can be replaced by fractional order, bilevel and, even, gradient free optimization methods. There is a possibility to add such analogues in graph, spiking, complex-valued, quantum and wavelet neural networks. Besides the usual problems of image recognition, time series prediction, object detection, there are many are other tasks for modern theory of machine learning, such as solving problem of quantum computations, partial differential and integro-differential equations, stochastic processes and Brownian motion, making decisions and computer algebra.

show abstract

“…2) Experimental Setting: We divide the UCM dataset into a training set and a testing set randomly according to a specific ratio (1:99, 1:9, 3:7, 8:2). We use the Adan optimizer [75] with a cosine learning rate scheduler and train for 200 epochs. 1 The results are evaluated for each backbone in NVIDIA 3080 GPU using the THOP library.…”

Section: Scene Classification 1) Uc Merced Land Use Datasetmentioning

confidence: 99%

CMID: A Unified Self-Supervised Learning Framework for Remote Sensing Image Understanding

Muhtar

Zhang

Xiao

et al. 2023

IEEE Trans. Geosci. Remote Sensing

View full text Add to dashboard Cite

Self-supervised learning (SSL) has gained widespread attention in the remote sensing (RS) and earth observation (EO) communities owing to its ability to learn task-agnostic representations without human-annotated labels. Nevertheless, most existing RS SSL methods are limited to learning either global semantic separable or local spatial perceptible representations. We argue that this learning strategy is suboptimal in the realm of RS, since the required representations for different RS downstream tasks are often varied and complex. In this study, we proposed a unified SSL framework that is better suited for RS images representation learning. The proposed SSL framework, Contrastive Mask Image Distillation (CMID), is capable of learning representations with both global semantic separability and local spatial perceptibility by combining contrastive learning (CL) with masked image modeling (MIM) in a self-distillation way. Furthermore, our CMID learning framework is architecture-agnostic, which is compatible with both convolutional neural networks (CNN) and vision transformers (ViT), allowing CMID to be easily adapted to a variety of deep learning (DL) applications for RS understanding. Comprehensive experiments have been carried out on four downstream tasks (i.e. scene classification, semantic segmentation, object-detection, and change detection) and the results show that models pre-trained using CMID achieve better performance than other state-of-the-art SSL methods on multiple downstream tasks. The code and pre-trained models will be made available at https://github.com/NJU-LHRS/official-CMID to facilitate SSL research and speed up the development of RS images DL applications.

show abstract

Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models

Cited by 28 publications

References 35 publications

A Survey of Historical Learning: Learning Models with Learning History

A Survey of Historical Learning: Learning Models with Learning History

Survey of Optimization Algorithms in Modern Neural Networks

CMID: A Unified Self-Supervised Learning Framework for Remote Sensing Image Understanding

Contact Info

Product

Resources

About