2019
DOI: 10.48550/arxiv.1909.13371
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Gradient Descent: The Ultimate Optimizer

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
9
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 6 publications
(10 citation statements)
references
References 6 publications
0
9
0
Order By: Relevance
“…It is also necessary to tune the learning rate in methods that use approximation of curvature information (such as quasi-Newton methods like BFGS/LBFGS, and methods using diagonal Hessian approximation like AdaHessian). The studies [22,42,8,1,26] have tackled the issue regarding tuning learning rate, and have developed methodologies with adaptive learning rate, η k , for firstorder methods. Specifically, the work [26] finds the learning rate by approximating the Lipschitz smoothness parameter in an affordable way without adding a tunable hyperparameter which is used for GD-type methods (with identity norm).…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…It is also necessary to tune the learning rate in methods that use approximation of curvature information (such as quasi-Newton methods like BFGS/LBFGS, and methods using diagonal Hessian approximation like AdaHessian). The studies [22,42,8,1,26] have tackled the issue regarding tuning learning rate, and have developed methodologies with adaptive learning rate, η k , for firstorder methods. Specifically, the work [26] finds the learning rate by approximating the Lipschitz smoothness parameter in an affordable way without adding a tunable hyperparameter which is used for GD-type methods (with identity norm).…”
Section: Related Workmentioning
confidence: 99%
“…Before we proceed, we make a few more comments about the Hessian diagonal D k in (8). As is clear from (8), a decaying exponential average of Hessian diagonal is used, which can be very useful in the noisy settings for smoothing out the Hessian noise over iterations.…”
Section: Deterministic Oasismentioning
confidence: 99%
See 1 more Smart Citation
“…Norm-adapted descent can also be seen as a gradient-based algorithm which adjusts its learning rate at every step. Other works which adjust hyperparameters in the course of training include [7,2]. The key difference between our work and these approaches is that our learning rate adjustment is made based on the Newton-Raphson estimate, rather than local curvature information or the gradient of a learning step with respect to a hyperparameter.…”
Section: Mnistmentioning
confidence: 99%
“…To see that (2) is the appropriate Newton-Raphson update, consider the first-order approximation of the effect of the update step (1) on the value of f . From the fact that ∂ ∂η f…”
Section: Introductionmentioning
confidence: 99%