Training Deep Neural Networks with Adaptive Momentum Inspired by the Quadratic Optimization

Sun, Tao; Ling, Huaming; Shi, Zuoqiang; Li, Dongsheng; Wang, Bao

doi:10.48550/arxiv.2110.09057

Cited by 6 publications

(14 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…And 3) HBNODEs can learn long-term dependencies effectively, capturing intrinsic patterns from data. There are numerous avenues for future works, and two particular interesting directions in our mind are 1) Improving HBNODEs, particularly replacing the fine-tuned or learned damping parameter with an adaptive one that are motivated by certain optimization algorithms with adaptive momentum [55,54,57], and 2) Applying HBNODE-based ROMs to model reduction arising from scientific challenges, especially when we do not have the ground truth governing equation of the dynamical systems.…”

Section: Discussionmentioning

confidence: 99%

Learning POD of Complex Dynamics Using Heavy-ball Neural ODEs

Baker¹,

Cherkaev²,

Narayan³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Proper orthogonal decomposition (POD) allows reduced-order modeling of complex dynamical systems at a substantial level, while maintaining a high degree of accuracy in modeling the underlying dynamical systems. Advances in machine learning algorithms enable learning POD-based dynamics from data and making accurate and fast predictions of dynamical systems. In this paper, we leverage the recently proposed heavy-ball neural ODEs (HBNODEs) [Xia et al. NeurIPS, 2021] for learning data-driven reduced-order models (ROMs) in the POD context, in particular, for learning dynamics of time-varying coefficients generated by the POD analysis on training snapshots generated from solving full order models. HBNODE enjoys several practical advantages for learning POD-based ROMs with theoretical guarantees, including 1) HBNODE can learn long-term dependencies effectively from sequential observations and 2) HBNODE is computationally efficient in both training and testing. We compare HBNODE with other popular ROMs on several complex dynamical systems, including the von Kármán Street flow, the Kurganov-Petrova-Popov equation, and the one-dimensional Euler equations for fluids modeling.

show abstract

Section: Discussionmentioning

confidence: 99%

Learning POD of Complex Dynamics Using Heavy-ball Neural ODEs

Baker¹,

Cherkaev²,

Narayan³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…These optimal hyper-parameters require the knowledge of the Lipschitz constant L, and μ, which are generally inaccessible. Since the SHB method has produced a great practical result, it has been studied by many researchers in both convex and nonconvex situations [30,47,46,44,45,42]. Besides, SHB can escape the saddle point with a larger learning rate [43] and successfully improve the training speed and accuracy in those image classification missions with Deep-Neural-Network (DNN) [6,18,48].…”

Section: Stochastic Heavy Ball and Adaptive Momentummentioning

confidence: 99%

“…Finding the optimal hyperparameters and computing them directly before the training begins is difficult and computational. To this end, an adaptive method is developed for the momentum for SHB, which uses the historical information [44]. To the best of our knowledge, there is no principled way to fine tune an optimization method, it is thus natural to raise the problem: Can we establish an easily method for tuning the normalized SHB method and guarantee its convergence?…”

Section: Introductionmentioning

confidence: 99%

Normalized Stochastic Heavy Ball with Adaptive Momentum1

Wen,

Deng,

Sun

et al. 2023

Frontiers in Artificial Intelligence and Applications

View full text Add to dashboard Cite

The heavy ball momentum technique is widely used in accelerating the machine learning training process, which has demonstrated significant practical success in optimization tasks. However, most heavy ball methods require a preset hyperparameter that will result in excessive tuning, and a calibrated fixed hyperparameter may not lead to optimal performance. In this paper, we propose an adaptive criterion for the choice of the normalized momentum-related hyperparameter, motivated by the quadratic optimization training problem, to eliminate the adverse for tuning the hyperparameter and thus allow for a computationally efficient optimizer. We theoretically prove that our proposed adaptive method promises convergence for L-Lipschitz functions. In addition, we verify its practical efficiency on existing extensive machine learning benchmarks for image classification tasks. The numerical results show that besides the speed improvement, our proposed methods enjoy advantages, including more robust to large learning rates and better generalization.

show abstract

“…The learning rate plays an important role in neural network training. 32,44,45 If it is too large, the training neural network is difficult to converge; if it is too small, the training neural network will converge slowly. Up to now, several works about learning rate strategies in neural network training have appeared, which can be summarized into the following categories.…”

Section: Related Workmentioning

confidence: 99%

An automatic learning rate decay strategy for stochastic gradient descent optimization methods in neural networks

Wang

Dou

Sun

et al. 2022

Int J of Intelligent Sys

Self Cite

View full text Add to dashboard Cite

Stochastic Gradient Descent (SGD) series optimization methods play the vital role in training neural networks, attracting growing attention in science and engineering fields of the intelligent system. The choice of learning rates affects the convergence rate of SGD series optimization methods. Currently, learning rate adjustment strategies mainly face the following problems: (1)The traditional learning rate decay method mainly adopts manual manner during training iterations, the small learning rate produced from which causes slow convergence in training neural networks. (2) Adaptive method (e.g., Adam) has poor generalization performance. To alleviate the above issues, we propose a novel automatic learning rate decay strategy for SGD optimization methods in neural networks. On the basis of the observation that the convergence rate's upper bound enjoys minimization in a specific iteration concerning the current learning rate, we first present the expression of the current learning rate determined by historical learning rates. And merely one extra parameter is initialized to generate automatic decreasing learning rates during the training process. Our proposed approach is applied to SGD and Momentum SGD optimization algorithms, and concrete theoretical proof explains its convergence. Numerical simulations

show abstract

Training Deep Neural Networks with Adaptive Momentum Inspired by the Quadratic Optimization

Cited by 6 publications

References 31 publications

Learning POD of Complex Dynamics Using Heavy-ball Neural ODEs

Learning POD of Complex Dynamics Using Heavy-ball Neural ODEs

Normalized Stochastic Heavy Ball with Adaptive Momentum1

An automatic learning rate decay strategy for stochastic gradient descent optimization methods in neural networks

Contact Info

Product

Resources

About