SAdam: A Variant of Adam for Strongly Convex Functions

Wang, Guanghui; Lu, Shiyin; Tu, Wei-Wei; Zhang, Lijun

doi:10.48550/arxiv.1905.02957

Cited by 4 publications

(5 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the proposed DiffGrad optimization method, the steps up to the computation of bias-corrected 1-st order moment m t and bias-corrected 2-nd order moment v t are the same as those of Adam optimization [38]. The DiffGrad optimization method updates θ t+1 , using the following update rule:…”

Section: Adam-type Algorithmsmentioning

confidence: 99%

Survey of Optimization Algorithms in Modern Neural Networks

Abdulkadirov¹,

Lyakhov²,

Nagornov³

2023

Preprint

View full text Add to dashboard Cite

Creating self-learning algorithms, developing deep neural networks and improving other methods that "learn" for various areas of human activity is the main goal of the theory of machine learning. It helps to replace the human with a machine, aiming to increase the quality of production. The theory of artificial neural networks, which already have replaced the humans in problems of detection of moving objects, recognition of images or sounds, time series prediction, big data analysis and numerical methods remains the most dispersed branch of the theory of machine learning. Certainly, for each area of human activity it is necessary to select appropriate neural network architectures, methods of data processing and some novel tools from applied mathematics. But the universal problem for all these neural networks with specific data is the achieving the highest accuracy in short time. Such problem can be resolved by increasing sizes of architectures and improving data preprocessing, where the accuracy rises with the training time. But there is a possibility to increase the accuracy without time growing, applying optimization methods. In this survey we demonstrate existing optimization algorithms of all types, which can be used in neural networks. There are presented modifications of basic optimization algorithms, such as stochastic gradient descent, adaptive moment estimation, Newton and quasi-Newton optimization methods. But the most recent optimization algorithms are related to information geometry, for Fisher-Rao and Bregman metrics. This approach in optimization extended the theory of classic neural networks to quantum and complex-valued neural networks, due to geometric and probabilistic tools. There are provided applications of all introduced optimization algorithms, what delighted many kinds of neural networks, which can be improved by including any advanced approaches in minimization of the loss function. Afterwards, we demonstrated ways of developing optimization algorithms in further researches, engaging neural networks with progressive architectures. Classical gradient based optimizers can be replaced by fractional order, bilevel and, even, gradient free optimization methods. There is a possibility to add such analogues in graph, spiking, complex-valued, quantum and wavelet neural networks. Besides the usual problems of image recognition, time series prediction, object detection, there are many are other tasks for modern theory of machine learning, such as solving problem of quantum computations, partial differential and integro-differential equations, stochastic processes and Brownian motion, making decisions and computer algebra.

show abstract

Section: Adam-type Algorithmsmentioning

confidence: 99%

Survey of Optimization Algorithms in Modern Neural Networks

Abdulkadirov¹,

Lyakhov²,

Nagornov³

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Reddi et al (2019) spots the issue of Adam convergence and provides a variant called AMSGrad while Zaheer et al (2018) argues the Adam only converges with large batch sizes. Subsequently, other variants of Adam are proposed in (Luo et al, 2019;Chen et al, 2019b;Huang et al, 2018;Wang et al, 2019b). Multiple lines of theoretical study on Adam are given in (Fang & Klabjan, 2019;Alacaoglu et al, 2020;Défossez et al, 2020)…”

Section: Related Workmentioning

confidence: 99%

Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam

Lu¹,

Li²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

1-bit communication is an effective method to scale up model training, and has been studied extensively on SGD. Its benefits, however, remain an open question on Adam-based model training (e.g. BERT and GPT). In this paper, we propose 0/1 Adam, which improves upon the state-of-the-art 1-bit Adam via two novel designs: (1) adaptive variance state freezing, which eliminates the requirement of running expensive full-precision communication at early stage of training; (2) 1-bit sync, which allows skipping communication rounds with bit-free synchronization over Adam's optimizer states, momentum and variance. In theory, we provide convergence analysis for 0/1 Adam on smooth non-convex objectives, and show the complexity bound is better than original Adam under certain conditions. On various benchmarks such as BERT-Base/Large pretraining and ImageNet, we demonstrate on up to 128 GPUs that 0/1 Adam is able to reduce up to 90% of data volume, 54% of communication rounds, and achieve up to 2× higher throughput compared to the state-of-the-art 1-bit Adam while enjoying the same statistical convergence speed and end-to-end model accuracy on GLUE dataset and ImageNet validation set.

show abstract

“…Besides the aforementioned, other variants of Adam include NosAdam [40], Sadam [41], Adax [42]), AdaBound [15] and Yogi [43]. ACProp could be combined with other techniques such as SWATS [44], LookAhead [45] and norm regularization similar to AdamP [46].…”

Section: Related Workmentioning

confidence: 99%

Momentum Centering and Asynchronous Update for Adaptive Gradient Methods

Zhuang¹,

Ding²,

Tang³

et al. 2021

Preprint

View full text Add to dashboard Cite

We propose ACProp (Asynchronous-centering-Prop), an adaptive optimizer which combines centering of second momentum and asynchronous update (e.g. for t-th update, denominator uses information up to step t − 1, while numerator uses gradient at t-th step). ACProp has both strong theoretical properties and empirical performance. With the example by Reddi et al. ( 2018), we show that asynchronous optimizers (e.g. AdaShift, ACProp) have weaker convergence condition than synchronous optimizers (e.g. Adam, RMSProp, AdaBelief); within asynchronous optimizers, we show that centering of second momentum further weakens the convergence condition. We demonstrate that ACProp has a convergence rate of O( 1 √ T ) for the stochastic non-convex case, which matches the oracle rate and outperforms the O( logT √ T ) rate of RMSProp and Adam. We validate ACProp in extensive empirical studies: ACProp outperforms both SGD and other adaptive optimizers in image classification with CNN, and outperforms well-tuned adaptive optimizers in the training of various GAN models, reinforcement learning and transformers. To sum up, ACProp has good theoretical properties including weak convergence condition and optimal convergence rate, and strong empirical performance including good generalization like SGD and training stability like Adam. We provide the implementation at https://github.com/juntang-zhuang/ACProp-Optimizer.

show abstract

SAdam: A Variant of Adam for Strongly Convex Functions

Cited by 4 publications

References 8 publications

Survey of Optimization Algorithms in Modern Neural Networks

Survey of Optimization Algorithms in Modern Neural Networks

Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam

Momentum Centering and Asynchronous Update for Adaptive Gradient Methods

Contact Info

Product

Resources

About