Adaptive Loss-Aware Quantization for Multi-Bit Networks

Qu, Zhongnan; Zhou, Zimu; Cheng, Yun; Thiele, Lothar

doi:10.1109/cvpr42600.2020.00801

Cited by 38 publications

(23 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this section, we compare our GMPQ with the stateof-the-art fixed-precision models containing APoT [25] and RQ [31] and mixed-precision networks including ALQ [38], HAWQ [9], EdMIPS [3], HAQ [50], BP-NAS [56], HMQ [13] and DQ [47] on ImageNet for image classification and on PASCAL VOC for object detection. We also provide the performance of full-precision models for reference.…”

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 99%

Generalizable Mixed-Precision Quantization via Attribution Rank Preservation

Wang

Han²,

et al. 2021

Preprint

View full text Add to dashboard Cite

In this paper, we propose a generalizable mixedprecision quantization (GMPQ) method for efficient inference. Conventional methods require the consistency of datasets for bitwidth search and model deployment to guarantee the policy optimality, leading to heavy search cost on challenging largescale datasets in realistic applications. On the contrary, our GMPQ searches the mixedquantization policy that can be generalized to largescale datasets with only a small amount of data, so that the search cost is significantly reduced without performance degradation. Specifically, we observe that locating network attribution correctly is general ability for accurate visual analysis across different data distribution. Therefore, despite of pursuing higher model accuracy and complexity, we preserve attribution rank consistency between the quantized models and their full-precision counterparts via efficient capacity-aware attribution imitation for generalizable mixed-precision quantization strategy search. Extensive experiments show that our method obtains competitive accuracy-complexity trade-off compared with the state-of-the-art mixed-precision networks in significantly reduced search cost. The code is available at https://github.com/ZiweiWangTHU/GMPQ.git.

show abstract

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 99%

Generalizable Mixed-Precision Quantization via Attribution Rank Preservation

Wang

Han²,

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Compression Quantization reduces the bit-width of NN parameters, which permits a drastic reduction of the memory footprint [24,33,40]. It has become a standard compression technique in TinyML due to its significant memory savings while usually having a negligible effect on accuracy [11].…”

Section: Related Workmentioning

confidence: 99%

“…Whereas quantization can in principle be used with any bit-width, e.g. 4 bit [7] or an adaptive bitwidth [40], we focus on 8 bit quantization which is supported by most MCUs. Unsupported bit-widths need to be emulated, resulting in inefficient hardware utilization [3,5].…”

Section: Related Workmentioning

confidence: 99%

“…This contradicts the resource-constrained nature of IoT devices which only provide a low-power MCU. In TinyML, researchers attempt to bring NNs to these edge devices by exploring more efficient models using neural architecture search (NAS) [9,32] and compressing pretrained NNs for efficient on-device inference through prun-ing [14,16], quantization [24,33,40], or distillation [19].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Measuring what Really Matters: Optimizing Neural Networks for TinyML

Heim,

Biri,

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

With the surge of inexpensive computational and memory resources, neural networks (NNs) have experienced an unprecedented growth in architectural and computational complexity. Introducing NNs to resource-constrained devices enables cost-efficient deployments, widespread availability, and the preservation of sensitive data.This work addresses the challenges of bringing Machine Learning to microcontroller units (MCUs), where we focus on the ubiquitous ARM Cortex-M architecture. The detailed effects and trade-offs that optimization methods, software frameworks, and MCU hardware architecture have on key performance metrics such as inference latency and energy consumption have not been previously studied in depth for state-of-the-art frameworks such as TensorFlow Lite Micro. We find that empirical investigations which measure the perceptible metrics -performance as experienced by the userare indispensable, as the impact of specialized instructions and layer types can be subtle. To this end, we propose an implementation-aware design as a cost-effective method for verification and benchmarking. Employing our developed toolchain, we demonstrate how existing NN deployments on resource-constrained devices can be improved by systematically optimizing NNs to their targeted application scenario.

show abstract

“…proposed, based on different compression methods such as knowledge distillation [4,31], pruning [15,19,20,40], quantization [32], neural architecture search (NAS) [38], etc. Among these categories, network pruning, which removes redundant and unimportant connections, is one of the most popular and promising compression methods, and recently received great interest from the industry that seeks to compress their AI models and fit them on small target devices with resource constraints.…”

Section: Introductionmentioning

confidence: 99%

Automatic Neural Network Pruning that Efficiently Preserves the Model Accuracy

Castells¹,

Yeom²

2021

Preprint

View full text Add to dashboard Cite

Neural networks performance has been significantly improved in the last few years, at the cost of an increasing number of floating point operations per second (FLOPs). However, more FLOPs can be an issue when computational resources are limited. As an attempt to solve this problem, pruning filters is a common solution, but most existing pruning methods do not preserve the model accuracy efficiently and therefore require a large number of finetuning epochs. In this paper, we propose an automatic pruning method that learns which neurons to preserve in order to maintain the model accuracy while reducing the FLOPs to a predefined target. To accomplish this task, we introduce a trainable bottleneck that only requires one single epoch with 25.6% (CIFAR-10) or 7.49% (ILSVRC2012) of the dataset to learn which filters to prune. Experiments on various architectures and datasets show that the proposed method can not only preserve the accuracy after pruning but also outperform existing methods after finetuning. We achieve a 52.00% FLOPs reduction on ResNet-50, with a Top-1 accuracy of 47.51% after pruning and a state-of-the-art (SOTA) accuracy of 76.63% after finetuning on ILSVRC2012. Code is available at (link anonymized for review).

show abstract

Adaptive Loss-Aware Quantization for Multi-Bit Networks

Cited by 38 publications

References 13 publications

Generalizable Mixed-Precision Quantization via Attribution Rank Preservation

Generalizable Mixed-Precision Quantization via Attribution Rank Preservation

Measuring what Really Matters: Optimizing Neural Networks for TinyML

Automatic Neural Network Pruning that Efficiently Preserves the Model Accuracy

Contact Info

Product

Resources

About