Gradient Descent: The Ultimate Optimizer

Chandra, K. Sekaran; Meijer, Erik; Andow, Samantha; Arroyo-Fang, Emilio; Dea, Irene; George, Johann; Grueter, Melissa; Hosmer, Basil; Stumpos, Steffi; Tempest, Alanna; Yang, Shannon

doi:10.48550/arxiv.1909.13371

Cited by 6 publications

(10 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It is also necessary to tune the learning rate in methods that use approximation of curvature information (such as quasi-Newton methods like BFGS/LBFGS, and methods using diagonal Hessian approximation like AdaHessian). The studies [22,42,8,1,26] have tackled the issue regarding tuning learning rate, and have developed methodologies with adaptive learning rate, η k , for firstorder methods. Specifically, the work [26] finds the learning rate by approximating the Lipschitz smoothness parameter in an affordable way without adding a tunable hyperparameter which is used for GD-type methods (with identity norm).…”

Section: Related Workmentioning

confidence: 99%

Section: Deterministic Oasismentioning

confidence: 99%

“…Before we proceed, we make a few more comments about the Hessian diagonal D k in (8). As is clear from (8), a decaying exponential average of Hessian diagonal is used, which can be very useful in the noisy settings for smoothing out the Hessian noise over iterations. Moreover, it approximates the scale of the Hessian diagonal with a satisfactory precision, unlike AdaHessian Algorithm (see Figure 1).…”

Section: Deterministic Oasismentioning

confidence: 99%

See 2 more Smart Citations

Doubly Adaptive Scaled Algorithm for Machine Learning Using Second-Order Information

Jahani¹,

Rusakov²,

Zheng³

et al. 2021

Preprint

View full text Add to dashboard Cite

We present a novel adaptive optimization algorithm for large-scale machine learning problems. Equipped with a low-cost estimate of local curvature and Lipschitz smoothness, our method dynamically adapts the search direction and step-size. The search direction contains gradient information preconditioned by a well-scaled diagonal preconditioning matrix that captures the local curvature information. Our methodology does not require the tedious task of learning rate tuning, as the learning rate is updated automatically without adding an extra hyperparameter. We provide convergence guarantees on a comprehensive collection of optimization problems, including convex, strongly convex, and nonconvex problems, in both deterministic and stochastic regimes. We also conduct an extensive empirical evaluation on standard machine learning problems, justifying our algorithm's versatility and demonstrating its strong performance compared to other start-of-theart first-order and second-order methods.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Deterministic Oasismentioning

confidence: 99%

Section: Deterministic Oasismentioning

confidence: 99%

See 1 more Smart Citation

Doubly Adaptive Scaled Algorithm for Machine Learning Using Second-Order Information

Jahani¹,

Rusakov²,

Zheng³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Norm-adapted descent can also be seen as a gradient-based algorithm which adjusts its learning rate at every step. Other works which adjust hyperparameters in the course of training include [7,2]. The key difference between our work and these approaches is that our learning rate adjustment is made based on the Newton-Raphson estimate, rather than local curvature information or the gradient of a learning step with respect to a hyperparameter.…”

Section: Mnistmentioning

confidence: 99%

“…To see that (2) is the appropriate Newton-Raphson update, consider the first-order approximation of the effect of the update step (1) on the value of f . From the fact that ∂ ∂η f…”

Section: Introductionmentioning

confidence: 99%

Reparametrizing gradient descent

Sprunger¹

2020

Preprint

View full text Add to dashboard Cite

In this work, we propose an optimization algorithm which we call norm-adapted gradient descent. This algorithm is similar to other gradient-based optimization algorithms like Adam or Adagrad in that it adapts the learning rate of stochastic gradient descent at each iteration. However, rather than using statistical properties of observed gradients, norm-adapted gradient descent relies on a first-order estimate of the effect of a standard gradient descent update step, much like the Newton-Raphson method in many dimensions. Our algorithm can also be compared to quasi-Newton methods, but we seek roots rather than stationary points. Seeking roots can be justified by the fact that for models with sufficient capacity measured by nonnegative loss functions, roots coincide with global optima. This work presents several experiments where we have used our algorithm; in these results, it appears norm-adapted descent is particularly strong in regression settings but is also capable of training classifiers.

show abstract

HOAX: a hyperparameter optimisation algorithm explorer for neural networks

2023

View full text Add to dashboard Cite

Computational chemistry has become an important tool to predict and understand molecular properties and reactions. Even though recent years have seen a significant growth in new algorithms and computational methods that speed up quantum chemical calculations, the bottleneck for trajectorybased methods to study photo-induced processes is still the huge number of electronic structure calculations. In this work, we present an innovative solution, in which the amount of electronic structure calculations is drastically reduced, by employing machine learning algorithms and methods borrowed from the realm of artificial intelligence. However, applying these algorithms effectively requires finding optimal hyperparameters, which remains a challenge itself. Here we present an automated user-friendly framework, HOAX, to perform the hyperparameter optimisation for neural networks, which bypasses the need for a lengthy manual process. The neural network-generated potential energy surfaces (PESs) reduce the computational costs compared to the ab initio-based PESs. We perform a comparative investigation on the performance of different hyperparameter optimiziation algorithms, namely grid search, simulated annealing, genetic algorithm, and Bayesian optimizer in finding the optimal hyperparameters necessary for constructing the well-performing neural network in order to fit the PESs of small organic molecules. Our results show that this automated toolkit not only facilitate a straightforward way to perform the hyperparameter optimisation but also the resulting neural networks-based generated PESs are in reasonable agreement with the ab initio-based PESs.

show abstract

Gradient Descent: The Ultimate Optimizer

Cited by 6 publications

References 6 publications

Doubly Adaptive Scaled Algorithm for Machine Learning Using Second-Order Information

Doubly Adaptive Scaled Algorithm for Machine Learning Using Second-Order Information

Reparametrizing gradient descent

HOAX: a hyperparameter optimisation algorithm explorer for neural networks

Contact Info

Product

Resources

About