Proceedings of the 2020 SIAM International Conference on Data Mining 2020
DOI: 10.1137/1.9781611976236.22
|View full text |Cite
|
Sign up to set email alerts
|

Hessian based analysis of SGD for Deep Nets: Dynamics and Generalization

Abstract: While stochastic gradient descent (SGD) and variants have been surprisingly successful for training deep nets, several aspects of the optimization dynamics and generalization are still not well understood.In this paper, we present new empirical observations and theoretical results on both the optimization dynamics and generalization behavior of SGD for deep nets based on the Hessian of the training loss and associated quantities. We consider three specific research questions:(1) what is the relationship betwee… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
16
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 18 publications
(17 citation statements)
references
References 43 publications
1
16
0
Order By: Relevance
“…Finally, Figure 6 suggests that exact line searches on the mini-batch loss perform poorly. Supporting results are obtained for SGD with momentum, for ResNet-18 and for Mo-bileNetV2 see Appendix Figures 9,15,16,17,18. However, in the case of SGD with momentum the line search is constantly not as exact.…”
Section: On the Behavior Of Line Search Approaches On The Full-batch ...mentioning
confidence: 59%
See 2 more Smart Citations
“…Finally, Figure 6 suggests that exact line searches on the mini-batch loss perform poorly. Supporting results are obtained for SGD with momentum, for ResNet-18 and for Mo-bileNetV2 see Appendix Figures 9,15,16,17,18. However, in the case of SGD with momentum the line search is constantly not as exact.…”
Section: On the Behavior Of Line Search Approaches On The Full-batch ...mentioning
confidence: 59%
“…SGD trajectories: Similar to this work [29] analyzes the loss along SGD trajectories, but with less focus on line searches and the exact shape of the full-batch loss. [12] and [16] consider second order information along SGD trajectories. Where [12] investigates the spectral norm of the Hessian (highest curvature) along the SGD trajectory and shows, inter alia, that it initially visits increasingly sharp regions.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…The phenomenon that the gradients of deep models live on a very low dimensional manifold has been widely observed (Gur-Ari et al, 2018;Vogels et al, 2019;Gooneratne et al, 2020;Li et al, 2020;Martin & Mahoney, 2018;Li et al, 2018). People have also used this fact to compress the gradient with low-rank approximation in the distributed optimization scenario (Yurtsever et al, 2017;Wang et al, 2018b;Karimireddy et al, 2019;Vogels et al, 2019).…”
Section: Related Workmentioning
confidence: 95%
“…The dimensional barrier is attributed to the fact that the added noise is isotropic while the gradients live on a very low dimensional manifold, which has been observed in (Gur-Ari et al, 2018;Vogels et al, 2019;Gooneratne et al, 2020;Li et al, 2020) and is also verified in Figure 2 for the gradients of a 20-layer ResNet (He et al, 2016). Hence to limit the noise energy, it is natural to think "Can we reduce the dimension of gradients first and then add the isotropic noise onto a low-dimensional gradient embedding?…”
Section: Introductionmentioning
confidence: 93%