2020
DOI: 10.48550/arxiv.2006.02409
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

On the Promise of the Stochastic Generalized Gauss-Newton Method for Training DNNs

Abstract: Following early work on Hessian-free methods for deep learning, we study a stochastic generalized Gauss-Newton method (SGN) for training deep neural networks. SGN is a second-order optimization method, with efficient iterations, that we demonstrate to often require substantially fewer iterations than standard SGD to converge. As the name suggests, SGN uses a Gauss-Newton approximation for the Hessian matrix, and, in order to efficiently compute an approximate search direction, relies on the conjugate gradient … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
2
1

Relationship

1
2

Authors

Journals

citations
Cited by 3 publications
(5 citation statements)
references
References 4 publications
0
5
0
Order By: Relevance
“…A possible reason is that when the residuals are big, doing more GN iterations may not lead a better direction for minimizing (37). A similar observation has been made in [53], for training DNNs. It is experimentally shown that higher number of CG iterations might not produce more accurate results if the Hessian obtained by mini-batch is not reliable due to nonrepresentative batches and/or big residuals.…”
Section: Results Of the Type II Modelmentioning
confidence: 53%
See 2 more Smart Citations
“…A possible reason is that when the residuals are big, doing more GN iterations may not lead a better direction for minimizing (37). A similar observation has been made in [53], for training DNNs. It is experimentally shown that higher number of CG iterations might not produce more accurate results if the Hessian obtained by mini-batch is not reliable due to nonrepresentative batches and/or big residuals.…”
Section: Results Of the Type II Modelmentioning
confidence: 53%
“…It is experimentally shown that higher number of CG iterations might not produce more accurate results if the Hessian obtained by mini-batch is not reliable due to nonrepresentative batches and/or big residuals. On the other hand, if the residuals are small, higher number of CG iterations can produce more accurate results thanks to the curvature information [53].…”
Section: Results Of the Type II Modelmentioning
confidence: 99%
See 1 more Smart Citation
“…To alleviate this issue, the update direction should be computed also taking into account second-order information. Second-order methods are notably more robust against the step-size selection than first-order methods, since their update includes information on the local curvature (Agarwal et al, 2019;Gargiani et al, 2020). Noise annealing strategies.…”
Section: Conclusion Limitations and Future Workmentioning
confidence: 99%
“…The above approximations are inspired by Gauss-Newton (GN) methods for nonlinear leastsquares problems (see, e.g,. [34]), where the Hessian matrix of the objective function p i=1 (r i − a i ) 2 (in which each r i is a scalar function and a i a scalar) is approximated by p i=1 ∇r i ∇r i , and also from the fact that the empirical risk of misclassification in ML is often a sum of non-negative terms matching a function to a scalar which can then be considered in a leastsquares fashion [3,15]. The resulting approximate adjoint equation (∇ y f ∇ y f ) λ = −∇ y f u is most likely infeasible, and we suggest solving it in the least-squares sense.…”
Section: Contributions Of the Papermentioning
confidence: 99%