Hessian based analysis of SGD for Deep Nets: Dynamics and Generalization

Li, Xinyan; Gu, Qianqun; Zhou, Yingxue; Chen, Tiancong; Banerjee, Arindam

doi:10.1137/1.9781611976236.22

Cited by 18 publications

(17 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finally, Figure 6 suggests that exact line searches on the mini-batch loss perform poorly. Supporting results are obtained for SGD with momentum, for ResNet-18 and for Mo-bileNetV2 see Appendix Figures 9,15,16,17,18. However, in the case of SGD with momentum the line search is constantly not as exact.…”

Section: On the Behavior Of Line Search Approaches On The Full-batch ...mentioning

confidence: 59%

“…SGD trajectories: Similar to this work [29] analyzes the loss along SGD trajectories, but with less focus on line searches and the exact shape of the full-batch loss. [12] and [16] consider second order information along SGD trajectories. Where [12] investigates the spectral norm of the Hessian (highest curvature) along the SGD trajectory and shows, inter alia, that it initially visits increasingly sharp regions.…”

Section: Related Workmentioning

confidence: 99%

“…Where [12] investigates the spectral norm of the Hessian (highest curvature) along the SGD trajectory and shows, inter alia, that it initially visits increasingly sharp regions. [16] investigates the dynamics and generalization of SGD based on the Hessian of the loss. They show, among other things, that the primary subspace of the second momentum of stochastic gradients overlaps substantially with that of the Hessian.…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Empirically explaining SGD from a line search perspective

Mutschler¹,

Zell²

2021

Preprint

View full text Add to dashboard Cite

Optimization in Deep Learning is mainly guided by vague intuitions and strong assumptions, with a limited understanding how and why these work in practice. To shed more light on this, our work provides some deeper understandings of how SGD behaves by empirically analyzing the trajectory taken by SGD from a line search perspective. Specifically, a costly quantitative analysis of the full-batch loss along SGD trajectories from common used models trained on a subset of CIFAR-10 is performed. Our core results include that the full-batch loss along lines in update step direction is highly parabolically. Further on, we show that there exists a learning rate with which SGD always performs almost exact line searches on the full-batch loss. Finally, we provide a different perspective why increasing the batch size has almost the same effect as decreasing the learning rate by the same factor.

show abstract

Section: On the Behavior Of Line Search Approaches On The Full-batch ...mentioning

confidence: 59%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Empirically explaining SGD from a line search perspective

Mutschler¹,

Zell²

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The phenomenon that the gradients of deep models live on a very low dimensional manifold has been widely observed (Gur-Ari et al, 2018;Vogels et al, 2019;Gooneratne et al, 2020;Li et al, 2020;Martin & Mahoney, 2018;Li et al, 2018). People have also used this fact to compress the gradient with low-rank approximation in the distributed optimization scenario (Yurtsever et al, 2017;Wang et al, 2018b;Karimireddy et al, 2019;Vogels et al, 2019).…”

Section: Related Workmentioning

confidence: 95%

Large Scale Private Learning via Low-rank Reparametrization

Yu¹,

Zhang²,

Chen³

et al. 2021

Preprint

View full text Add to dashboard Cite

We propose a reparametrization scheme to address the challenges of applying differentially private SGD on large neural networks, which are 1) the huge memory cost of storing individual gradients, 2) the added noise suffering notorious dimensional dependence. Specifically, we reparametrize each weight matrix with two gradient-carrier matrices of small dimension and a residual weight matrix. We argue that such reparametrization keeps the forward/backward process unchanged while enabling us to compute the projected gradient without computing the gradient itself. To learn with differential privacy, we design reparametrized gradient perturbation (RGP) that perturbs the gradients on gradientcarrier matrices and reconstructs an update for the original weight from the noisy gradients. Importantly, we use historical updates to find the gradient-carrier matrices, whose optimality is rigorously justified under linear regression and empirically verified with deep learning tasks. RGP significantly reduces the memory cost and improves the utility. For example, we are the first able to apply differential privacy on the BERT model and achieve an average accuracy of 83.9% on four downstream tasks with = 8, which is within 5% loss compared to the non-private baseline but enjoys much lower privacy leakage risk.

show abstract

“…The dimensional barrier is attributed to the fact that the added noise is isotropic while the gradients live on a very low dimensional manifold, which has been observed in (Gur-Ari et al, 2018;Vogels et al, 2019;Gooneratne et al, 2020;Li et al, 2020) and is also verified in Figure 2 for the gradients of a 20-layer ResNet (He et al, 2016). Hence to limit the noise energy, it is natural to think "Can we reduce the dimension of gradients first and then add the isotropic noise onto a low-dimensional gradient embedding?…”

Section: Introductionmentioning

confidence: 93%

Do Not Let Privacy Overbill Utility: Gradient Embedding Perturbation for Private Learning

Yu,

Zhang,

Chen

et al. 2021

Preprint

View full text Add to dashboard Cite

The privacy leakage of the model about the training data can be bounded in the differential privacy mechanism. However, for meaningful privacy parameters, a differentially private model degrades the utility drastically when the model comprises a large number of trainable parameters. In this paper, we propose an algorithm Gradient Embedding Perturbation (GEP) towards training differentially private deep models with decent accuracy. Specifically, in each gradient descent step, GEP first projects individual private gradient into a non-sensitive anchor subspace, producing a low-dimensional gradient embedding and a small-norm residual gradient. Then, GEP perturbs the low-dimensional embedding and the residual gradient separately according to the privacy budget. Such a decomposition permits a small perturbation variance, which greatly helps to break the dimensional barrier of private learning. With GEP, we achieve decent accuracy with reasonable computational cost and modest privacy guarantee for deep models. Especially, with privacy bound = 8, we achieve 74.9% test accuracy on CIFAR10 and 95.1% test accuracy on SVHN, significantly improving over existing results.

show abstract

Hessian based analysis of SGD for Deep Nets: Dynamics and Generalization

Cited by 18 publications

References 43 publications

Empirically explaining SGD from a line search perspective

Empirically explaining SGD from a line search perspective

Large Scale Private Learning via Low-rank Reparametrization

Do Not Let Privacy Overbill Utility: Gradient Embedding Perturbation for Private Learning

Contact Info

Product

Resources

About