2020
DOI: 10.1007/s00245-020-09718-8
|View full text |Cite
|
Sign up to set email alerts
|

Backtracking Gradient Descent Method and Some Applications in Large Scale Optimisation. Part 2: Algorithms and Experiments

Abstract: In this paper, we provide new results and algorithms (including backtracking versions of Nesterov accelerated gradient and Momentum) which are more applicable to large scale optimisation as in Deep Neural Networks. We also demonstrate that Backtracking Gradient Descent (Backtracking GD) can obtain good upper bound estimates for local Lipschitz constants for the gradient, and that the convergence rate of Backtracking GD is similar to that in classical work of Armijo. Experiments with datasets CIFAR10 and CIFAR1… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
35
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
6

Relationship

1
5

Authors

Journals

citations
Cited by 21 publications
(35 citation statements)
references
References 27 publications
0
35
0
Order By: Relevance
“…In case f has compact sublevels, then this is easily proven [9]. For the general case, see [10] for a proof.…”
Section: Convergence Resultsmentioning
confidence: 99%
See 4 more Smart Citations
“…In case f has compact sublevels, then this is easily proven [9]. For the general case, see [10] for a proof.…”
Section: Convergence Resultsmentioning
confidence: 99%
“…There are many popular modifications trying to overcome this, such as Adam, Adadelta, Nesterov Accelerated Gradient, Momentum and so on (see [14] for a review), none of these are guaranteed to converge in general either. To date, only Backtracking GD is guaranteed to converge: see Chapter 12 in [9], in particular Proposition 12.6.1 there, for the case f ∈ C 1,1 L and has compact sublevels and has at most countably many critical points, see [8] when f is real analytic (or more generally satisfies the so-called Losjasiewicz gradient inequality), and see [10] for the general case of f being C 1 only and has at most countably many critical points. Note that the assumption in the last paper is not too restrictive: indeed, it is known from transversality results that such an assumption is satisfied by a generic C 1 function (for example, by Morse's functions, which are a well-known class of functions in geometry and analysis).…”
Section: Convergence Resultsmentioning
confidence: 99%
See 3 more Smart Citations