Trained Quantization Thresholds for Accurate and Efficient Fixed-Point Inference of Deep Neural Networks

Jain, Sambhav R.; Gural, Albert; Wu, Michael C.; Dick, Chris

doi:10.48550/arxiv.1903.08066

Cited by 14 publications

(22 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In case of the scale s, several studies [1, 9,19,20] have obtained empirical or specific value such as the power-of-2. In this work, we assume the range represented by the original b-bit precision weights is equal to that of the up-scaled (b+1)-bit precision weights.…”

Section: Quantization Methodsmentioning

confidence: 99%

Dual-Precision Deep Neural Network

Park

Choi

2020

Proceedings of the 2020 3rd International Conference on Artificial Intelligence and Pattern Recognition

View full text Add to dashboard Cite

On-line Precision scalability of the deep neural networks(DNNs) is a critical feature to support accuracy and complexity trade-off during the DNN inference. In this paper, we propose dual-precision DNN that includes two different precision modes in a single model, thereby supporting an on-line precision switch without re-training. The proposed two-phase training process optimizes both low-and high-precision modes.

show abstract

Section: Quantization Methodsmentioning

confidence: 99%

Dual-Precision Deep Neural Network

Park

Choi

2020

Proceedings of the 2020 3rd International Conference on Artificial Intelligence and Pattern Recognition

View full text Add to dashboard Cite

show abstract

“…However, post-training quantization on these models results in an unacceptably sharp decline in accuracy [21], dropping from 90% or better to 1% or worse on the ImageNet dataset. Accuracy can be reclaimed using various methods, including re-training and quantization-aware training [53,54], but this is not always possible or convenient if the newly required computation requirement or expertise is high, or if the training data are unavailable due to legal or privacy issues. We posit that error-bound lossy compression algorithms may be an alternative, accuracy-preserving method of compressing depth-wise separable models.…”

Section: Quantization Effectiveness On Mobilenetsmentioning

confidence: 99%

MobileNets Can Be Lossily Compressed: Neural Network Compression for Embedded Accelerators

Lim

Jun

2022

Electronics

View full text Add to dashboard Cite

Although neural network quantization is an imperative technology for the computation and memory efficiency of embedded neural network accelerators, simple post-training quantization incurs unacceptable levels of accuracy degradation on some important models targeting embedded systems, such as MobileNets. While explicit quantization-aware training or re-training after quantization can often reclaim lost accuracy, this is not always possible or convenient. We present an alternative approach to compressing such difficult neural networks, using a novel variant of the ZFP lossy floating-point compression algorithm to compress both model weights and inter-layer activations and demonstrate that it can be efficiently implemented on an embedded FPGA platform. Our ZFP variant, which we call ZFPe, is designed for efficient implementation on embedded accelerators, such as FPGAs, requiring a fraction of chip resources per bandwidth compared to state-of-the-art lossy compression accelerators. ZFPe-compressing the MobileNet V2 model with an 8-bit budget per weight and activation results in significantly higher accuracy compared to 8-bit integer post-training quantization and shows no loss of accuracy, compared to an uncompressed model when given a 12-bit budget per floating-point value. To demonstrate the benefits of our approach, we implement an embedded neural network accelerator on a realistic embedded acceleration platform equipped with the low-power Lattice ECP5-85F FPGA and a 32 MB SDRAM chip. Each ZFPe module consumes less than 6% of LUTs while compressing or decompressing one value per cycle, requiring a fraction of the resources compared to state-of-the-art compression accelerators while completely removing the memory bottleneck of our accelerator.

show abstract

“…Many works address these issues using different methods. These include pruning [16,45,47], efficient neural architecture design [14,21,24,38], hardware and CNN co-design [14,20,43] and quantization [6,13,15,23,24,46].…”

Section: Introductionmentioning

confidence: 99%

“…For high compression rates, this is usually achieved by fine-tuning a pre-trained model for quantization. In addition, recent work in quantization focused on making quantizers more hardware friendly (amenable to deployment on embedded devices) by restricting quantization schemes to be: per-tensor, uniform, symmetric and with thresholds that are powers of two [24,41].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

HMQ: Hardware Friendly Mixed Precision Quantization Block for CNNs

Habi¹,

Jennings²,

Netzer³

2020

Preprint

View full text Add to dashboard Cite

Recent work in network quantization produced state-of-the-art results using mixed precision quantization. An imperative requirement for many efficient edge device hardware implementations is that their quantizers are uniform and with power-of-two thresholds. In this work, we introduce the Hardware Friendly Mixed Precision Quantization Block (HMQ) in order to meet this requirement. The HMQ is a mixed precision quantization block that repurposes the Gumbel-Softmax estimator into a smooth estimator of a pair of quantization parameters, namely, bit-width and threshold. HMQs use this to search over a finite space of quantization schemes. Empirically, we apply HMQs to quantize classification models trained on CIFAR10 and ImageNet. For ImageNet, we quantize four different architectures and show that, in spite of the added restrictions to our quantization scheme, we achieve competitive and, in some cases, state-of-the-art results.

show abstract

Trained Quantization Thresholds for Accurate and Efficient Fixed-Point Inference of Deep Neural Networks

Cited by 14 publications

References 0 publications

Dual-Precision Deep Neural Network

Dual-Precision Deep Neural Network

MobileNets Can Be Lossily Compressed: Neural Network Compression for Embedded Accelerators

HMQ: Hardware Friendly Mixed Precision Quantization Block for CNNs

Contact Info

Product

Resources

About