2017 International Joint Conference on Neural Networks (IJCNN) 2017
DOI: 10.1109/ijcnn.2017.7966082
|View full text |Cite
|
Sign up to set email alerts
|

Nesterov's accelerated gradient and momentum as approximations to regularised update descent

Abstract: We present a unifying framework for adapting the update direction in gradient-based iterative optimization methods. As natural special cases we re-derive classical momentum and Nesterov's accelerated gradient method, lending a new intuitive interpretation to the latter algorithm. We show that a new algorithm, which we term Regularised Gradient Descent, can converge more quickly than either Nesterov's algorithm or the classical momentum algorithm.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
48
0

Year Published

2018
2018
2020
2020

Publication Types

Select...
5
3
2

Relationship

0
10

Authors

Journals

citations
Cited by 96 publications
(48 citation statements)
references
References 3 publications
0
48
0
Order By: Relevance
“…Dropout regularization ( Srivastava et al, 2014 ) with a dropout ratio of 0.5 is applied to outputs of the first fully connected layer. The model is trained by optimizing the multinomial logistic regression objective using stochastic gradient descent (SGD) ( LeCun, Bengio & Hinton, 2015 ) and Nesterov’s momentum ( Botev, Lever & Barber, 2017 ). The customized model is optimized for hyper-parameters by a randomized grid search method ( Bergstra & Bengio, 2012 ).…”
Section: Methodsmentioning
confidence: 99%
“…Dropout regularization ( Srivastava et al, 2014 ) with a dropout ratio of 0.5 is applied to outputs of the first fully connected layer. The model is trained by optimizing the multinomial logistic regression objective using stochastic gradient descent (SGD) ( LeCun, Bengio & Hinton, 2015 ) and Nesterov’s momentum ( Botev, Lever & Barber, 2017 ). The customized model is optimized for hyper-parameters by a randomized grid search method ( Bergstra & Bengio, 2012 ).…”
Section: Methodsmentioning
confidence: 99%
“…Each Training-ValueNet is a MLP regression network with a single hidden layer of 1024 units. Training is carried out using minibatch SGD with a batch size of 32 and 0.9 Nesterov momentum [25]. We also use dropout [22] after the hidden layer at a rate of 0.7.…”
Section: Monte-carlo Estimationmentioning
confidence: 99%
“…The momentum term (Qian, 1999) of SGD helps in accelerating the process by allowing the SGD to navigate better in ravines. However, although the momentum term has proved extremely useful, there has been an improvement on it which is known as Nesterov Accelerated Gradient (NAG) (Botev et al, 2017). This allows the calculation of the gradient not based on the current parameters but based on the future position of the parameters.…”
Section: System Designmentioning
confidence: 99%