2020
DOI: 10.48550/arxiv.2006.08173
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Neural gradients are near-lognormal: improved quantized and sparse training

Abstract: Neural gradient compression remains a main bottleneck in improving training efficiency, as most existing neural network compression methods (e.g., pruning or quantization) focus on weights, activations, and weight gradients. However, these methods are not suitable for compressing neural gradients, which have a very different distribution. Specifically, we find that the neural gradients follow a lognormal distribution. Taking this into account, we suggest two methods to reduce the computational and memory burde… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

1
3
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(4 citation statements)
references
References 17 publications
1
3
0
Order By: Relevance
“…Our result shows that ResNets have log-Gaussian behaviour on initialization, and like fully connected networks [22], the behaviour is determined by the depth-to-width aspect ratio. This corroborates recent empirical observations about deep ResNets [25]. Since real world networks are finite, the question of how well this approximates finite behaviour is of paramount importance.…”
Section: Introductionsupporting
confidence: 88%
See 2 more Smart Citations
“…Our result shows that ResNets have log-Gaussian behaviour on initialization, and like fully connected networks [22], the behaviour is determined by the depth-to-width aspect ratio. This corroborates recent empirical observations about deep ResNets [25]. Since real world networks are finite, the question of how well this approximates finite behaviour is of paramount importance.…”
Section: Introductionsupporting
confidence: 88%
“…The input-output derivative ∂ xi z out has the same type of behaviour as z out itself; a simple proof is given in Appendix B. It is expected that the gradient with respect to the weights ∂ W ij z out will also have the same qualitative behaviour [25] although more investigation is needed to understand this theoretically. Exponentially large variance for gradients is a manifestation of the vanishing-and-exploding gradient problem [28].…”
Section: Vanishing and Exploding Normsmentioning
confidence: 96%
See 1 more Smart Citation
“…Quantization techniques have demonstrated remarkable performance improvements in training and inference of DNN models [35][36][37][38][39]. Similarly, the advancements in halfprecision and mixed-precision training [40,41] have played an important role in efficient DNN execution.…”
Section: Related Workmentioning
confidence: 99%