HMQ: Hardware Friendly Mixed Precision Quantization Block for CNNs

Habi, Hai Victor; Jennings, Roy H.; Netzer, Arnon

doi:10.1007/978-3-030-58574-7_27

Cited by 45 publications

(11 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…One area uses uniform-precision quantization where the model shares the same precision Choukroun et al, 2019;Gong et al, 2019;Langroudi et al, 2019;Jin et al, 2020a;Bhalgat et al, 2020;Darvish Rouhani et al, 2020;Oh et al, 2021). Another direction studies mixed-precision that determines bit-width for each layer through search algorithms, aiming at better accuracy-efficiency trade-off (Dong et al, 2019;Wang et al, 2019;Habi et al, 2020;Fu et al, 2020;Yang & Jin, 2020;Zhao et al, 2021a;b;Ma et al, 2021b). There is also binarization network, which only applies 1-bit (Rastegari et al, 2016;Hubara et al, 2016;Cai et al, 2017;Bulat et al, 2020;.…”

Section: Related Workmentioning

confidence: 99%

F8Net: Fixed-Point 8-bit Only Multiplication for Network Quantization

Jin¹,

Ren²,

Zhuang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Neural network quantization is a promising compression technique to reduce memory footprint and save energy consumption, potentially leading to real-time inference. However, there is a performance gap between quantized and fullprecision models. To reduce it, existing quantization approaches require highprecision INT32 or full-precision multiplication during inference for scaling or dequantization. This introduces a noticeable cost in terms of memory, speed, and required energy. To tackle these issues, we present F8Net, a novel quantization framework consisting of only fixed-point 8-bit multiplication. To derive our method, we first discuss the advantages of fixed-point multiplication with different formats of fixed-point numbers and study the statistical behavior of the associated fixedpoint numbers. Second, based on the statistical and algorithmic analysis, we apply different fixed-point formats for weights and activations of different layers. We introduce a novel algorithm to automatically determine the right format for each layer during training. Third, we analyze a previous quantization algorithmparameterized clipping activation (PACT)-and reformulate it using fixed-point arithmetic. Finally, we unify the recently proposed method for quantization finetuning and our fixed-point approach to show the potential of our method. We verify F8Net on ImageNet for MobileNet V1/V2 and ResNet18/50. Our approach achieves comparable and better performance, when compared not only to existing quantization techniques with INT32 multiplication or floating-point arithmetic, but also to the full-precision counterparts, achieving state-of-the-art performance.

show abstract

Section: Related Workmentioning

confidence: 99%

F8Net: Fixed-Point 8-bit Only Multiplication for Network Quantization

Jin¹,

Ren²,

Zhuang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Later, [2] suggests an end-to-end learning approach using a rate-distortion objective. To optimize performance under quantization, several works [16,21,52,54] use mixed-precision quantization, while others [9,18,28,32,39,40] propose post-quantization optimization techniques.…”

Section: Model Compressionmentioning

confidence: 99%

Implicit Neural Representations for Image Compression

Strümpler¹,

Postels²,

Ren³

et al. 2021

Preprint

View full text Add to dashboard Cite

Recently Implicit Neural Representations (INRs) gained attention as a novel and effective representation for various data types. Thus far, prior work mostly focused on optimizing their reconstruction performance. This work investigates INRs from a novel perspective, i.e., as a tool for image compression. To this end, we propose the first comprehensive image compression pipeline based on INRs including quantization, quantization-aware retraining and entropy coding. Encoding with INRs, i.e., overfitting to a data sample, is typically orders of magnitude slower. To mitigate this drawback, we leverage meta-learned initializations based on MAML to reach the encoding in fewer gradient updates which also generally improves rate-distortion performance of INRs. We find that our approach to source compression with INRs vastly outperforms similar prior work, is competitive with common compression algorithms designed specifically for images and closes the gap to state-of-theart learned approaches based on Rate-Distortion Autoencoders. Moreover, we provide an extensive ablation study on the importance of individual components of our method which we hope facilitates future research on this novel approach to image compression.

show abstract

“…Power-of-Two Thresholds. A uniform, symmetric quantizer (either signed or unsigned) with a power-of-two integer threshold is said to be a hardwarefriendly quantizer [18]. Restricting the threshold of a symmetric quantizer to power-of-two integers (i.e.…”

Section: Background and Basic Notionsmentioning

confidence: 99%

HPTQ: Hardware-Friendly Post Training Quantization

Habi¹,

Peretz²,

Cohén³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Neural network quantization enables the deployment of models on edge devices. An essential requirement for their hardware efficiency is that the quantizers are hardware-friendly: uniform, symmetric and with power-oftwo thresholds. To the best of our knowledge, current post-training quantization methods do not support all of these constraints simultaneously. In this work we introduce a hardware-friendly post training quantization (HPTQ) framework, which addresses this problem by synergistically combining several known quantization methods. We perform a large-scale study on four tasks: classification, object detection, semantic segmentation and pose estimation over a wide variety of network architectures. Our extensive experiments show that competitive results can be obtained under hardware-friendly constraints.

show abstract

HMQ: Hardware Friendly Mixed Precision Quantization Block for CNNs

Cited by 45 publications

References 24 publications

F8Net: Fixed-Point 8-bit Only Multiplication for Network Quantization

F8Net: Fixed-Point 8-bit Only Multiplication for Network Quantization

Implicit Neural Representations for Image Compression

HPTQ: Hardware-Friendly Post Training Quantization

Contact Info

Product

Resources

About