2020
DOI: 10.1007/s11263-020-01339-6
|View full text |Cite
|
Sign up to set email alerts
|

Hardware-Centric AutoML for Mixed-Precision Quantization

Abstract: Model quantization is a widely used technique to compress and accelerate deep neural network (DNN) inference. Emergent DNN hardware accelerators begin to support mixed precision (1-8 bits) to further improve the computation efficiency, which raises a great challenge to find the optimal bitwidth for each layer: it requires domain experts to explore the vast design space trading off accuracy, latency, energy, and model size, which is both time-consuming and usually sub-optimal. There are plenty of specialized ha… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
7
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
3
1

Relationship

3
5

Authors

Journals

citations
Cited by 11 publications
(7 citation statements)
references
References 17 publications
(21 reference statements)
0
7
0
Order By: Relevance
“…ReLeQ [83] and HAQ [84] adopt reinforcement learning to learn the wordlength of each data-structure in a layerwise manner. Specifically, the ReLeQ uses the predicted bit-precision level to quantize the weights as in WRPN, whereas the HAQ quantizes both weights and activations on each explored wordlength in the same way as TensorRT.…”
Section: Mixed-precision Quantizationmentioning
confidence: 99%
See 1 more Smart Citation
“…ReLeQ [83] and HAQ [84] adopt reinforcement learning to learn the wordlength of each data-structure in a layerwise manner. Specifically, the ReLeQ uses the predicted bit-precision level to quantize the weights as in WRPN, whereas the HAQ quantizes both weights and activations on each explored wordlength in the same way as TensorRT.…”
Section: Mixed-precision Quantizationmentioning
confidence: 99%
“…As a summary of the literature, improving the accuracy of quantized DNNs comes at the expense of floating-point computational cost in [30], [32], [34], [35], [38], [42], [45], [56]- [58], [61], [63]- [67], [69], [74], [76], [78]- [80], [82]- [84], [86]- [88]. Specifically, these approaches scale output activations of each layer with FP32 coefficient(s) to recover the dynamic range, and/or perform batch normalization as well as the operations of first and last layers with FP32 datastructures.…”
Section: Mixed-precision Quantizationmentioning
confidence: 99%
“…In order to tackle these, researchers have introduced different techniques to reduce the search cost, including differentiable architecture search [52], path-level binarization [53], single-path one-shot sampling [54], [55], [56], and weight sharing [50], [56], [57]. Furthermore, neural architecture search has also been used in compressing and accelerating neural networks, including pruning [35], [58], [59], [60], [61] and quantization [37], [54], [62], [63]. Most of these methods are tailored for 2D visual recognition, which has many well-defined search spaces [64].…”
Section: Neural Architecture Searchmentioning
confidence: 99%
“…To tackle these, researchers have proposed different techniques to reduce the search cost, including differentiable architecture search [30], path-level binarization [6], single-path one-shot sampling [15,8,4], and weight sharing [50,4,57]. Besides, neural architecture search has also been used in compressing and accelerating neural networks, including pruning [17,31,5,27] and quantization [58,15,59,62]. Most of these methods are tailored for 2D visual recognition, which has many well-defined search spaces [44].…”
Section: Neural Architecture Searchmentioning
confidence: 99%