Towards Unified INT8 Training for Convolutional Neural Network

Zhu, Feng; Gong, Ruihao; Yu, Fei; Liu, Xianglong; Wang, Yanfei; Li, Zhelong; Yang, Xiuqi; Yan, Junjie

doi:10.1109/cvpr42600.2020.00204

Cited by 124 publications

(79 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, the SOTA neural network models suffer massive parameters and large sizes to achieve good performance in different tasks, which also cause significant complex computation and great resource consumption. To compress and accelerate the deep CNNs, many approaches have been proposed, which can be classified into five categories: transferred/compact convolutional filters [89,85,78]; quantization/binarization [35,11,82,92]; knowledge distillation [12,86,16]; pruning [28,31,22]; low-rank factorization [46,38,47,79].…”

Section: Related Workmentioning

confidence: 99%

Forward and Backward Information Retention for Accurate Binary Neural Networks

Qin

Gong

Liu

et al. 2020

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Self Cite

268

234

View full text Add to dashboard Cite

Model binarization is an effective method of compressing neural networks and accelerating their inference process, which enables state-of-the-art models to run on resource-limited devices. Recently, advanced binarization methods have been greatly improved by minimizing the quantization error directly in the forward process. However, a significant performance gap still exists between the 1-bit model and the 32-bit one. The empirical study shows that binarization causes a great loss of information in the forward and backward propagation which harms the performance of binary neural networks (BNNs), and the limited information representation ability of binarized parameter is one of the bottlenecks of BNN performance. We present a novel Distributionsensitive Information Retention Network (DIR-Net) to retain the information of the forward activations and backward gradients, which improves BNNs

show abstract

Section: Related Workmentioning

confidence: 99%

Forward and Backward Information Retention for Accurate Binary Neural Networks

Qin

Gong

Liu

et al. 2020

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Self Cite

268

234

View full text Add to dashboard Cite

show abstract

“…It yields more efficient inference since all the quantization parameters are known in advance and fixed. Several of the most common range estimators include: current min-max or simply min-max, uses the full dynamic range of the tensor (Zhou et al, 2016;Wu et al, 2018b;Zhu et al, 2020);…”

Section: Background and Related Workmentioning

confidence: 99%

Understanding and Overcoming the Challenges of Efficient Transformer Quantization

Yelysei¹,

Nagel²,

Blankevoort³

2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Transformer-based architectures have become the de-facto standard models for a wide range of Natural Language Processing tasks. However, their memory footprint and high latency are prohibitive for efficient deployment and inference on resource-limited devices. In this work, we explore quantization for transformers. We show that transformers have unique quantization challenges -namely, high dynamic activation ranges that are difficult to represent with a low bit fixed-point format. We establish that these activations contain structured outliers in the residual connections that encourage specific attention patterns, such as attending to the special separator token.To combat these challenges, we present three solutions based on post-training quantization and quantization-aware training, each with a different set of compromises for accuracy, model size, and ease of use. In particular, we introduce a novel quantization scheme -per-embedding-group quantization. We demonstrate the effectiveness of our methods on the GLUE benchmark using BERT, establishing state-of-the-art results for post-training quantization. Finally, we show that transformer weights and embeddings can be quantized to ultra-low bit-widths, leading to significant memory savings with a minimum accuracy loss.

show abstract

“…Due to the resource-constrained environment of the edge devices, a high-performance on-device learning system needs to reduce the resource cost as much as possible, including the hardware-level computation overhead, communication-level network traffic and energy-level consumption. Overall, resource saving is one of the most essential demands to deploy on-device learning [49]. Personalized Model.…”

Section: B On-device Learningmentioning

confidence: 99%

“…Therefore, the gradients of activation and weights of current layer are also calculated as INT8. After obtaining the full gradients of this layer, we need to dequantize them into FP32 format and update the model parameters in full precision for better optimization accuracy [49]. Therefore, a full-precision copy of the original weights and activations is needed for conducting the model updating in the FP32 data type.…”

Section: B Low-precision Data Representationmentioning

confidence: 99%

On-Device Learning Systems for Edge Intelligence: A Software and Hardware Synergy Perspective

Zhou

Guo

et al. 2021

IEEE Internet Things J.

View full text Add to dashboard Cite

Modern machine learning (ML) applications are often deployed in the cloud environment to exploit the computational power of clusters. However, this in-cloud computing scheme cannot satisfy the demands of emerging edge intelligence scenarios, including providing personalized models, protecting user privacy, adapting to real-time tasks and saving resource cost. In order to conquer the limitations of conventional in-cloud computing, it comes the rise of on-device learning, which makes the endto-end ML procedure totally on user devices, without unnecessary involvement of the cloud. In spite of the promising advantages of on-device learning, implementing a high-performance on-device learning system still faces with many severe challenges, such as insufficient user training data, backward propagation blocking and limited peak processing speed. Observing the substantial improvement space in the implementation and acceleration of on-device learning systems, we intend to present a comprehensive analysis of the latest research progress and point out potential optimization directions from the system perspective. This survey presents a software and hardware synergy of on-device learning techniques, covering the scope of model-level neural network design, algorithm-level training optimization and hardware-level instruction acceleration. We hope this survey could bring fruitful discussions and inspire the researchers to further promote the field of edge intelligence.

show abstract

Towards Unified INT8 Training for Convolutional Neural Network

Cited by 124 publications

References 25 publications

Forward and Backward Information Retention for Accurate Binary Neural Networks

Forward and Backward Information Retention for Accurate Binary Neural Networks

Understanding and Overcoming the Challenges of Efficient Transformer Quantization

On-Device Learning Systems for Edge Intelligence: A Software and Hardware Synergy Perspective

Contact Info

Product

Resources

About