Heavy-ball Algorithms Always Escape Saddle Points

Sun, Tao; Li, Dongsheng; Quan, Zhe; Jiang, Hao; Li, Shengguo; Dou, Yong

doi:10.24963/ijcai.2019/488

Cited by 13 publications

(7 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…1.2.0.1 Heavy ball: The convergence of deterministic HB, i.e., HB with exact gradient, has been thoroughly studied by [11], [14], [15], [16], [17] in both convex and nonconvex cases. An interesting finding is that HB can escape saddle points in nonconvex optimization by using a larger learning rate than GD [18]. HB momentum has also been successfully integrated into SGD to improve training DNNs.…”

Section: Additional Related Workmentioning

confidence: 99%

Training Deep Neural Networks with Adaptive Momentum Inspired by the Quadratic Optimization

Sun¹,

Ling²,

Shi³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Heavy ball momentum is crucial in accelerating (stochastic) gradient-based optimization algorithms for machine learning. Existing heavy ball momentum is usually weighted by a uniform hyperparameter, which relies on excessive tuning. Moreover, the calibrated fixed hyperparameter may not lead to optimal performance. In this paper, to eliminate the effort for tuning the momentum-related hyperparameter, we propose a new adaptive momentum inspired by the optimal choice of the heavy ball momentum for quadratic optimization. Our proposed adaptive heavy ball momentum can improve stochastic gradient descent (SGD) and Adam. SGD and Adam with the newly designed adaptive momentum are more robust to large learning rates, converge faster, and generalize better than the baselines. We verify the efficiency of SGD and Adam with the new adaptive momentum on extensive machine learning benchmarks, including image classification, language modeling, and machine translation. Finally, we provide convergence guarantees for SGD and Adam with the proposed adaptive momentum.

show abstract

Section: Additional Related Workmentioning

confidence: 99%

Training Deep Neural Networks with Adaptive Momentum Inspired by the Quadratic Optimization

Sun¹,

Ling²,

Shi³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…In the nonconvex community, the inertial technique [17] (also called as heavy-ball or momentum) is widely used and proved to be algorithmically efficient [18,19,20,21]. Besides acceleration and good practical performance for nonconvex problems, the advantage of inertial technique is illustrated by weaker conditions avoiding saddle points [22]. The procedure of inertial method is quite simple, it uses linear combination of current and last point for next iteration.…”

Section: Inertial Methodsmentioning

confidence: 99%

Inertial Proximal Deep Learning Alternating Minimization for Efficient Neutral Network Training

Qiao

Sun

Pan

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

In recent years, the Deep Learning Alternating Minimization (DLAM), which is actually the alternating minimization applied to the penalty form of the deep neutral networks training, has been developed as an alternative algorithm to overcome several drawbacks of Stochastic Gradient Descent (SGD) algorithms. This work develops an improved DLAM by the well-known inertial technique, namely iPDLAM, which predicts a point by linearization of current and last iterates. To obtain further training speed, we apply a warmup technique to the penalty parameter, that is, starting with a small initial one and increasing it in the iterations. Numerical results on real-world datasets are reported to demonstrate the efficiency of our proposed algorithm.

show abstract

“…optimal oracle complexity result for (1.1) when F is smooth. The work [58] studies how heavy-ball technique can help SGM escape saddle points. Distributed/parallel stochastic methods with delayed (sub)gradient information.…”

Section: Methodsmentioning

confidence: 99%

Distributed stochastic inertial-accelerated methods with delayed derivatives for nonconvex problems

Xu¹,

Xu²,

Yan³

et al. 2021

Preprint

View full text Add to dashboard Cite

Stochastic gradient methods (SGMs) are predominant approaches for solving stochastic optimization. On smooth nonconvex problems, a few acceleration techniques have been applied to improve the convergence rate of SGMs. However, little exploration has been made on applying a certain acceleration technique to a stochastic subgradient method (SsGM) for nonsmooth nonconvex problems. In addition, few efforts have been made to analyze an (accelerated) SsGM with delayed derivatives. The information delay naturally happens in a distributed system, where computing workers do not coordinate with each other.In this paper, we propose an inertial proximal SsGM for solving nonsmooth nonconvex stochastic optimization problems. The proposed method can have guaranteed convergence even with delayed derivative information in a distributed environment. Convergence rate results are established to three classes of nonconvex problems: weaklyconvex nonsmooth problems with a convex regularizer, composite nonconvex problems with a nonsmooth convex regularizer, and smooth nonconvex problems. For each problem class, the convergence rate is O(1/K 1 2 ) in the expected value of the gradient norm square, for K iterations. In a distributed environment, the convergence rate of the proposed method will be slowed down by the information delay. Nevertheless, the slow-down effect will decay with the number of iterations for the latter two problem classes. We test the proposed method on three applications. The numerical results clearly demonstrate the advantages of using the inertial-based acceleration. Furthermore, we observe higher parallelization speed-up in asynchronous updates over the synchronous counterpart, though the former uses delayed derivatives.

show abstract

Heavy-ball Algorithms Always Escape Saddle Points

Cited by 13 publications

References 6 publications

Training Deep Neural Networks with Adaptive Momentum Inspired by the Quadratic Optimization

Training Deep Neural Networks with Adaptive Momentum Inspired by the Quadratic Optimization

Inertial Proximal Deep Learning Alternating Minimization for Efficient Neutral Network Training

Distributed stochastic inertial-accelerated methods with delayed derivatives for nonconvex problems

Contact Info

Product

Resources

About