2021
DOI: 10.48550/arxiv.2106.08295
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A White Paper on Neural Network Quantization

Abstract: While neural networks have advanced the frontiers in many applications, they often come at a high computational cost. Reducing the power and latency of neural network inference is key if we want to integrate modern networks into edge devices with strict power and compute requirements. Neural network quantization is one of the most effective ways of achieving these savings but the additional noise it induces can lead to accuracy degradation. In this white paper, we introduce state-of-the-art algorithms for miti… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

1
111
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 65 publications
(112 citation statements)
references
References 12 publications
1
111
0
Order By: Relevance
“…When v is the activation output, the second term can be pre-computed and absorbed by the convolution bias, so it will not introduce extra calculation to inference. Therefore, it is recommended by Nagel et al (2021) to apply asymmetric quantization on activation and symmetric quantization on weights. Nagel et al (2021) also suggests to adopt a per-channel quantization on weights (Krishnamoorthi, 2018; Li et al, 2019).…”
Section: Preliminarymentioning
confidence: 99%
See 3 more Smart Citations
“…When v is the activation output, the second term can be pre-computed and absorbed by the convolution bias, so it will not introduce extra calculation to inference. Therefore, it is recommended by Nagel et al (2021) to apply asymmetric quantization on activation and symmetric quantization on weights. Nagel et al (2021) also suggests to adopt a per-channel quantization on weights (Krishnamoorthi, 2018; Li et al, 2019).…”
Section: Preliminarymentioning
confidence: 99%
“…Therefore, it is recommended by Nagel et al (2021) to apply asymmetric quantization on activation and symmetric quantization on weights. Nagel et al (2021) also suggests to adopt a per-channel quantization on weights (Krishnamoorthi, 2018; Li et al, 2019).…”
Section: Preliminarymentioning
confidence: 99%
See 2 more Smart Citations
“…Real-time inference on resource-constrained and efficiency-demanding platforms has long been desired and extensively studied in the last decades, resulting in significant improvement on the trade-off between efficiency and accuracy (Han et al, 2015;Mei et al, 2019;Tanaka et al, 2020;Ma et al, 2020;Mishra et al, 2020;Liang et al, 2021;Liu et al, 2021). As a model compression technique, quantization is promising compared to other methods, such as network pruning (Tanaka et al, 2020;Ma et al, 2020; and slimming (Liu et al, 2017;2018), as it achieves a large compression ratio (Krishnamoorthi, 2018;Nagel et al, 2021) and is computationally beneficial for integer-only hardware. The latter one is especially important because many hardwares (e.g., most brands of DSPs (Ho, 2015;QCOM, 2019)) only support integer or fixed-point arithmetic for accelerated implementation and cannot deploy models with floating-point operations.…”
Section: Introductionmentioning
confidence: 99%