Moran Shkolnik scite author profile

Neural gradient compression remains a main bottleneck in improving training efficiency, as most existing neural network compression methods (e.g., pruning or quantization) focus on weights, activations, and weight gradients. However, these methods are not suitable for compressing neural gradients, which have a very different distribution. Specifically, we find that the neural gradients follow a lognormal distribution. Taking this into account, we suggest two methods to reduce the computational and memory burdens of neural gradients. The first one is stochastic gradient pruning, which can accurately set the sparsity level -up to 85% gradient sparsity without hurting validation accuracy (ResNet18 on ImageNet). The second method determines the floating-point format for low numerical precision gradients (e.g., FP8). Our results shed light on previous findings related to local scaling, the optimal bit-allocation for the mantissa and exponent, and challenging workloads for which low-precision floating-point arithmetic has reported to fail. Reference implementation accompanies the paper. * Equal contribution.Preprint. Under review.

show abstract

Thanks for Nothing: Predicting Zero-Valued Activations with Lightweight Convolutional Neural Networks

Shomron

Banner

Shkolnik

et al. 2019

Preprint

View full text Add to dashboard Cite

Convolutional neural networks (CNNs) introduce stateof-the-art results for various tasks with the price of high computational demands. Inspired by the observation that spatial correlation exists in CNN output feature maps (ofms), we propose a method to dynamically predict whether ofm activations are zero-valued or not according to their neighboring activation values, thereby avoiding zerovalued activations and reducing the number of convolution operations. We implement the zero activation predictor (ZAP) with a lightweight CNN, which imposes negligible overheads and is easy to train and deploy on existing models. Furthermore, without model retraining, the same ZAP can be tuned to many different operating points along the accuracy-savings trade-off curve. For example, using VGG-16 and the ILSVRC-2012 dataset, two different operating points achieve a reduction of 20% and 30% multiplyaccumulate (MAC) operations with top-1/top-5 accuracy degradation of 0.1%/0.04% and 1.3%/0.7% without finetuning of the entire model, respectively. Considering oneepoch fine-tuning, 45% MAC operations may be reduced with 1.3%/0.7% accuracy degradation.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Moran Shkolnik

Thanks for Nothing: Predicting Zero-Valued Activations with Lightweight Convolutional Neural Networks

Robust Quantization: One Model to Rule Them All

Neural gradients are near-lognormal: improved quantized and sparse training

Thanks for Nothing: Predicting Zero-Valued Activations with Lightweight Convolutional Neural Networks

Contact Info

Product

Resources

About