HMQ: Hardware Friendly Mixed Precision Quantization Block for CNNs

Habi, Hai Victor; Jennings, Roy H.; Netzer, Arnon

doi:10.48550/arxiv.2007.09952

Cited by 5 publications

(14 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In [37], the weights and activations are quantized separately in a two-step strategy. Mixed-precision is widely employed to achieve smaller quantization errors, such as LQ-Net [43], DJPQ [38] and HMQ [11]. In HAQ [36], the training policy is learned by reinforcement learning.…”

Section: Quantization Methodsmentioning

confidence: 99%

See 1 more Smart Citation

FAT: Learning Low-Bitwidth Parametric Representation via Frequency-Aware Transformation

Tao,

Lin,

Chen

et al. 2021

Preprint

View full text Add to dashboard Cite

Learning convolutional neural networks (CNNs) with low bitwidth is challenging because performance may drop significantly after quantization. Prior arts often discretize the network weights by carefully tuning hyper-parameters of quantization (e.g. non-uniform stepsize and layer-wise bitwidths), which are complicated and sub-optimal because the full-precision and low-precision models have large discrepancy. This work presents a novel quantization pipeline, Frequency-Aware Transformation (FAT), which has several appealing benefits. (1) Rather than designing complicated quantizers like existing works, FAT learns to transform network weights in the frequency domain before quantization, making them more amenable to training in low bitwidth. ( 2) With FAT, CNNs can be easily trained in low precision using simple standard quantizers without tedious hyper-parameter tuning. Theoretical analysis shows that FAT improves both uniform and non-uniform quantizers. (3) FAT can be easily plugged into many CNN architectures. When training ResNet-18 and MobileNet-V2 in 4 bits, FAT plus a simple rounding operation 1 already achieves 70.5% and 69.2% top-1 accuracy on ImageNet without bells and whistles, outperforming recent state-of-the-art by reducing 54.9× and 45.7× computations against full-precision models. We hope FAT provide a novel perspective for model quantization. Code is available at https://github.com/ChaofanTao/ FAT_Quantization.

show abstract

Section: Quantization Methodsmentioning

confidence: 99%

“…Previous approaches [43,17,11,38,8] generally model the task of quantization as an error minimization problem, viz. min W − Q(W) , where W is the weight and Q(•) the quantizer.…”

Section: Introductionmentioning

confidence: 99%

FAT: Learning Low-Bitwidth Parametric Representation via Frequency-Aware Transformation

Tao,

Lin,

Chen

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…In this section, we compare our GMPQ with the stateof-the-art fixed-precision models containing APoT [25] and RQ [31] and mixed-precision networks including ALQ [38], HAWQ [9], EdMIPS [3], HAQ [50], BP-NAS [56], HMQ [13] and DQ [47] on ImageNet for image classification and on PASCAL VOC for object detection. We also provide the performance of full-precision models for reference.…”

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 99%

“…Yang et al [55] decoupled the constrained optimization via Alternating Direction Method of Multipliers (ADMM), and Wang et al [53] utilized the variational information bottleneck to search for the proper bitwidth and pruning ratio. Habi et al [13] and Van et al [48] directly optimized the quantization intervals for bitwidth selection of mixed-precision networks. However, differentiable search for mixed-precision quantization still needs a large amount of time due to the optimization of the large hypernet.…”

Section: Mixed-precision Quantizationmentioning

confidence: 99%

Generalizable Mixed-Precision Quantization via Attribution Rank Preservation

Wang

Han²,

et al. 2021

Preprint

View full text Add to dashboard Cite

In this paper, we propose a generalizable mixedprecision quantization (GMPQ) method for efficient inference. Conventional methods require the consistency of datasets for bitwidth search and model deployment to guarantee the policy optimality, leading to heavy search cost on challenging largescale datasets in realistic applications. On the contrary, our GMPQ searches the mixedquantization policy that can be generalized to largescale datasets with only a small amount of data, so that the search cost is significantly reduced without performance degradation. Specifically, we observe that locating network attribution correctly is general ability for accurate visual analysis across different data distribution. Therefore, despite of pursuing higher model accuracy and complexity, we preserve attribution rank consistency between the quantized models and their full-precision counterparts via efficient capacity-aware attribution imitation for generalizable mixed-precision quantization strategy search. Extensive experiments show that our method obtains competitive accuracy-complexity trade-off compared with the state-of-the-art mixed-precision networks in significantly reduced search cost. The code is available at https://github.com/ZiweiWangTHU/GMPQ.git.

show abstract

“…3, we observe that depthwise convolution has larger bitwidth than the regular convolution. As found in (Jain et al, 2019), the depthwise convolution with irregular weight distributions is the main reason that makes quan- to be superior to their fixed bitwidth counterparts (Wang et al, 2019;Uhlich et al, 2020;Cai & Vasconcelos, 2020;Habi et al, 2020). DDQ is naturally used to perform mixprecision training by a binary block-diagonal matrix U .…”

Section: Evaluation On Imagenetmentioning

confidence: 99%

Differentiable Dynamic Quantization with Mixed Precision and Adaptive Resolution

Zhaoyang,

Wenqi,

Jinwei

et al. 2021

Preprint

View full text Add to dashboard Cite

Model quantization is challenging due to many tedious hyper-parameters such as precision (bitwidth), dynamic range (minimum and maximum discrete values) and stepsize (interval between discrete values). Unlike prior arts that carefully tune these values, we present a fully differentiable approach to learn all of them, named Differentiable Dynamic Quantization (DDQ), which has several benefits. ( 1) DDQ is able to quantize challenging lightweight architectures like Mo-bileNets, where different layers prefer different quantization parameters. (2) DDQ is hardwarefriendly and can be easily implemented using lowprecision matrix-vector multiplication, making it capable in many hardware such as ARM. (3) DDQ reduces training runtime by 25% compared to state-of-the-arts. Extensive experiments show that DDQ outperforms prior arts on many networks and benchmarks, especially when models are already efficient and compact. e.g. DDQ is the first approach that achieves lossless 4-bit quantization for MobileNetV2 on ImageNet.

show abstract

HMQ: Hardware Friendly Mixed Precision Quantization Block for CNNs

Cited by 5 publications

References 34 publications

FAT: Learning Low-Bitwidth Parametric Representation via Frequency-Aware Transformation

FAT: Learning Low-Bitwidth Parametric Representation via Frequency-Aware Transformation

Generalizable Mixed-Precision Quantization via Attribution Rank Preservation

Differentiable Dynamic Quantization with Mixed Precision and Adaptive Resolution

Contact Info

Product

Resources

About