2021
DOI: 10.48550/arxiv.2103.14949
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Automated Backend-Aware Post-Training Quantization

Ziheng Jiang,
Animesh Jain,
Andrew Liu
et al.

Abstract: Quantization is a key technique to reduce the resource requirement and improve the performance of neural network deployment. However, different hardware backends such as x86 CPU, NVIDIA GPU, ARM CPU, and accelerators may demand different implementations for quantized networks. This diversity calls for specialized post-training quantization pipelines to built for each hardware target, an engineering effort that is often too large for developers to keep up with. We tackle this problem with an automated post-trai… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
5
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(5 citation statements)
references
References 8 publications
0
5
0
Order By: Relevance
“…This is because the CodeGen in the Glow is not efficiently implemented at the int8 quantization level for chosen target platforms. Previous studies have reported that the naive implemented kernels for quantization can be slower than the original models [10,22,44,45]. As mentioned in Section 3, the four schemes are a trade-off between inference latency and quantization error.…”
Section: Latencymentioning
confidence: 99%
See 2 more Smart Citations
“…This is because the CodeGen in the Glow is not efficiently implemented at the int8 quantization level for chosen target platforms. Previous studies have reported that the naive implemented kernels for quantization can be slower than the original models [10,22,44,45]. As mentioned in Section 3, the four schemes are a trade-off between inference latency and quantization error.…”
Section: Latencymentioning
confidence: 99%
“…In a nutshell, the quantization method in DL frameworks preserves the accuracy of the quantized models; however, it is insufficient to provide latency improvement on diverse hardware devices. To overcome this challenge, few studies have been proposed [10,22,44,45]. An efficient kernel code generation for quantized models is an important research direction that is beyond the scope of our study.…”
Section: Latencymentioning
confidence: 99%
See 1 more Smart Citation
“…PTQ methods, which convert high-precision represen-tation bits to low-precision bits without requiring retraining steps, have been extensively studied by researchers and widely adopted in practical scenarios [21,18,1,4,33,45,22,10,30,31,42]. PTQ helps in the rapid deployment of CNN models on resource-constrained devices by addressing too time-consuming and data privacy issues associated with retraining.…”
Section: Model Quantizationmentioning
confidence: 99%
“…Despite QAT's accuracy preservation benefits over PTQ, its adoption has been limited due to privacy concerns, resource-intensive and timeconsuming retraining processes, and the need for expertise in developing model architectures for hyper-parameter tuning [21,9,2,44,19,46,16,12]. In practice, PTQ methods have been more commonly employed due to their applicability [21,18,1,4,45,22,10,31,42]. PTQ allows pre-trained models to be calibrated without the need for retraining, using only a small unlabeled dataset.…”
Section: Introductionmentioning
confidence: 99%