Quasi-hyperbolic momentum and Adam for deep learning

Ma, Jerry; Yarats, Denis

doi:10.48550/arxiv.1810.06801

Cited by 19 publications

(29 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Indeed, along with HB and NAG, the QHM method can also be seen as a numerical integrator on GM-ODE. QHM was shown to be very competitive in deep learning tasks (Choi et al, 2019) as well as in the strongly-convex setting (see Appendix J in (Ma and Yarats, 2018)). However, to the best of our knowledge, QHM has only been studied in the quadratic case (Gitman et al, 2019) (hence the novelty of our rate).…”

Section: Summary Of the Resultsmentioning

confidence: 99%

“…3.2 in Khalil and Grizzle ( 2002)). The model above is inspired by the quasi-hyperbolic momentum (QHM) algorithm 9 developed in Ma and Yarats (2018). We discuss the connection to QHM later in Sec.…”

Section: Continuous-time Analysismentioning

confidence: 99%

“…The generality of our model and our discretization analysis provides an accelerated convergence rate for a broad class of momentum methods. Among these methods is quasihyperbolic momentum (Ma and Yarats, 2018), which 15 shows promises in optimization for neural nets (Choi et al, 2019).…”

Section: Nag Seementioning

confidence: 99%

“…where a, b ∈ (0, 1). For classification tasks, QHM yields an accelerated rate on real-world datasets (even better than NAG) (Ma and Yarats, 2018). Despite empirical benefits, the convergence analysis for this algorithm is limited to quadratics (Gitman et al, 2019).…”

Section: Nag Seementioning

confidence: 99%

See 3 more Smart Citations

Revisiting the Role of Euler Numerical Integration on Acceleration and Stability in Convex Optimization

Zhang,

Orvieto,

Daneshmand

et al. 2021

Preprint

View full text Add to dashboard Cite

Viewing optimization methods as numerical integrators for ordinary differential equations (ODEs) provides a thought-provoking modern framework for studying accelerated first-order optimizers. In this literature, acceleration is often supposed to be linked to the quality of the integrator (accuracy, energy preservation, symplecticity). In this work, we propose a novel ordinary differential equation that questions this connection: both the explicit and the semi-implicit (a.k.a symplectic) Euler discretizations on this ODE lead to an accelerated algorithm for convex programming. Although semi-implicit methods are well-known in numerical analysis to enjoy many desirable features for the integration of physical systems, our findings show that these properties do not necessarily relate to acceleration.

show abstract

Section: Summary Of the Resultsmentioning

confidence: 99%

Section: Continuous-time Analysismentioning

confidence: 99%

Section: Nag Seementioning

confidence: 99%

Section: Nag Seementioning

confidence: 99%

See 2 more Smart Citations

Revisiting the Role of Euler Numerical Integration on Acceleration and Stability in Convex Optimization

Zhang,

Orvieto,

Daneshmand

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The most basic improvements of gradient descent are momentum and Nesterov acceleration. There is a large body of current research either analyzing or suggesting modifications to (non-adaptive) momentum-based methods (Wibisono and Wilson, 2015;Wibisono et al, 2016;Yuan et al, 2016;Jin et al, 2017;Lucas et al, 2018;Ma and Yarats, 2018;Cyrus et al, 2018;Srinivasan et al, 2018;Kovachki and Stuart, 2019;Chen and Kyrillidis, 2019;Gitman et al, 2019).…”

Section: Discussion Context and Recommendationsmentioning

confidence: 99%

Gradient descent with momentum --- to accelerate or to super-accelerate?

Nakerst¹,

Brennan²,

Haque³

2020

Preprint

View full text Add to dashboard Cite

We consider gradient descent with 'momentum', a widely used method for loss function minimization in machine learning. This method is often used with 'Nesterov acceleration', meaning that the gradient is evaluated not at the current position in parameter space, but at the estimated position after one step. In this work, we show that the algorithm can be improved by extending this 'acceleration'by using the gradient at an estimated position several steps ahead rather than just one step ahead. How far one looks ahead in this 'super-acceleration' algorithm is determined by a new hyperparameter. Considering a one-parameter quadratic loss function, the optimal value of the super-acceleration can be exactly calculated and analytically estimated. We show explicitly that super-accelerating the momentum algorithm is beneficial, not only for this idealized problem, but also for several synthetic loss landscapes and for the MNIST classification task with neural networks. Super-acceleration is also easy to incorporate into adaptive algorithms like RMSProp or Adam, and is shown to improve these algorithms.

show abstract

Epistatic Features and Machine Learning Improve Alzheimer’s Risk Prediction Over Polygenic Risk Scores

Hermes

Cady

Armentrout

et al. 2023

Preprint

View full text Add to dashboard Cite

Background. Polygenic risk scores (PRS) are linear combinations of genetic markers weighted by effect size that are commonly used to predict disease risk. For complex heritable diseases such as late onset Alzheimer's disease (LOAD), PRS models fail to capture much of the heritability. Additionally, PRS models are highly dependent on the population structure of data on which effect sizes are assessed, and have poor generalizability to new data. Objective. The goal of this study is to construct a paragenic risk score that, in addition to single genetic marker data used in PRS, incorporates epistatic interaction features and machine learning methods to predict lifetime risk for LOAD. Methods. We construct a new state-of-the-art genetic model for lifetime risk of Alzheimer's disease. Our approach innovates over PRS models in two ways: First, by directly incorporating epistatic interactions between SNP loci using an evolutionary algorithm guided by shared pathway information; and second, by estimating risk via an ensemble of machine learning models (gradient boosting machines and deep learning) instead of simple logistic regression. We compare the paragenic model to a PRS model from the literature trained on the same dataset. Results. The paragenic model is significantly more accurate than the PRS model under 10-fold cross-validation, obtaining an AUC of 83% and near-clinically significant matched sensitivity/specificity of 75%, and remains significantly more accurate when evaluated on an independent holdout dataset. Additionally, the paragenic model maintains accuracy within APOE genotypes. Conclusion. Paragenic models show potential for improving lifetime disease risk prediction for complex heritable diseases such as LOAD over PRS models.

show abstract

Quasi-hyperbolic momentum and Adam for deep learning

Cited by 19 publications

References 22 publications

Revisiting the Role of Euler Numerical Integration on Acceleration and Stability in Convex Optimization

Revisiting the Role of Euler Numerical Integration on Acceleration and Stability in Convex Optimization

Gradient descent with momentum --- to accelerate or to super-accelerate?

Epistatic Features and Machine Learning Improve Alzheimer’s Risk Prediction Over Polygenic Risk Scores

Contact Info

Product

Resources

About