Abstract:Neural network quantization has become an important research area due to its great impact on deployment of large models on resource constrained devices. In order to train networks that can be effectively discretized without loss of performance, we introduce a differentiable quantization procedure. Differentiability can be achieved by transforming continuous distributions over the weights and activations of the network to categorical distributions over the quantization grid. These are subsequently relaxed to co… Show more
“…The proposed FAT is built on Pytorch framework. We compare FAT with state-of-the-art approaches, including WAGE [40], LQ-Net [43], PACT [7], RQ [20], UNIQ [3], DQ [35], BCGD [2] [35], DSQ [10], QIL [13], HAQ [36], APoT [17], HMQ [11] DJPQ [38], LSQ [8].…”
Section: Methodsmentioning
confidence: 99%
“…As shown in Table 2, we compare all methods appeared in the main paper, including WAGE [40], LQ-Net [43], PACT [7], RQ [20], UNIQ [3], DQ [35], BCGD [2] [35], DSQ [10], QIL [13], HAQ [36], APoT [17], [11] DJPQ [38], LSQ [8].…”
Section: Categorization Of Quantization Methodsmentioning
Learning convolutional neural networks (CNNs) with low bitwidth is challenging because performance may drop significantly after quantization. Prior arts often discretize the network weights by carefully tuning hyper-parameters of quantization (e.g. non-uniform stepsize and layer-wise bitwidths), which are complicated and sub-optimal because the full-precision and low-precision models have large discrepancy. This work presents a novel quantization pipeline, Frequency-Aware Transformation (FAT), which has several appealing benefits. (1) Rather than designing complicated quantizers like existing works, FAT learns to transform network weights in the frequency domain before quantization, making them more amenable to training in low bitwidth. ( 2) With FAT, CNNs can be easily trained in low precision using simple standard quantizers without tedious hyper-parameter tuning. Theoretical analysis shows that FAT improves both uniform and non-uniform quantizers. (3) FAT can be easily plugged into many CNN architectures. When training ResNet-18 and MobileNet-V2 in 4 bits, FAT plus a simple rounding operation 1 already achieves 70.5% and 69.2% top-1 accuracy on ImageNet without bells and whistles, outperforming recent state-of-the-art by reducing 54.9× and 45.7× computations against full-precision models. We hope FAT provide a novel perspective for model quantization. Code is available at https://github.com/ChaofanTao/ FAT_Quantization.
“…The proposed FAT is built on Pytorch framework. We compare FAT with state-of-the-art approaches, including WAGE [40], LQ-Net [43], PACT [7], RQ [20], UNIQ [3], DQ [35], BCGD [2] [35], DSQ [10], QIL [13], HAQ [36], APoT [17], HMQ [11] DJPQ [38], LSQ [8].…”
Section: Methodsmentioning
confidence: 99%
“…As shown in Table 2, we compare all methods appeared in the main paper, including WAGE [40], LQ-Net [43], PACT [7], RQ [20], UNIQ [3], DQ [35], BCGD [2] [35], DSQ [10], QIL [13], HAQ [36], APoT [17], [11] DJPQ [38], LSQ [8].…”
Section: Categorization Of Quantization Methodsmentioning
Learning convolutional neural networks (CNNs) with low bitwidth is challenging because performance may drop significantly after quantization. Prior arts often discretize the network weights by carefully tuning hyper-parameters of quantization (e.g. non-uniform stepsize and layer-wise bitwidths), which are complicated and sub-optimal because the full-precision and low-precision models have large discrepancy. This work presents a novel quantization pipeline, Frequency-Aware Transformation (FAT), which has several appealing benefits. (1) Rather than designing complicated quantizers like existing works, FAT learns to transform network weights in the frequency domain before quantization, making them more amenable to training in low bitwidth. ( 2) With FAT, CNNs can be easily trained in low precision using simple standard quantizers without tedious hyper-parameter tuning. Theoretical analysis shows that FAT improves both uniform and non-uniform quantizers. (3) FAT can be easily plugged into many CNN architectures. When training ResNet-18 and MobileNet-V2 in 4 bits, FAT plus a simple rounding operation 1 already achieves 70.5% and 69.2% top-1 accuracy on ImageNet without bells and whistles, outperforming recent state-of-the-art by reducing 54.9× and 45.7× computations against full-precision models. We hope FAT provide a novel perspective for model quantization. Code is available at https://github.com/ChaofanTao/ FAT_Quantization.
“…In this section, we compare our GMPQ with the stateof-the-art fixed-precision models containing APoT [25] and RQ [31] and mixed-precision networks including ALQ [38], HAWQ [9], EdMIPS [3], HAQ [50], BP-NAS [56], HMQ [13] and DQ [47] on ImageNet for image classification and on PASCAL VOC for object detection. We also provide the performance of full-precision models for reference.…”
Section: Comparison With State-of-the-art Methodsmentioning
In this paper, we propose a generalizable mixedprecision quantization (GMPQ) method for efficient inference. Conventional methods require the consistency of datasets for bitwidth search and model deployment to guarantee the policy optimality, leading to heavy search cost on challenging largescale datasets in realistic applications. On the contrary, our GMPQ searches the mixedquantization policy that can be generalized to largescale datasets with only a small amount of data, so that the search cost is significantly reduced without performance degradation. Specifically, we observe that locating network attribution correctly is general ability for accurate visual analysis across different data distribution. Therefore, despite of pursuing higher model accuracy and complexity, we preserve attribution rank consistency between the quantized models and their full-precision counterparts via efficient capacity-aware attribution imitation for generalizable mixed-precision quantization strategy search. Extensive experiments show that our method obtains competitive accuracy-complexity trade-off compared with the state-of-the-art mixed-precision networks in significantly reduced search cost. The code is available at https://github.com/ZiweiWangTHU/GMPQ.git.
“…For both of the MNIST models, we found that letting each subcomponent of F be a simple dimensionwise scalar affine transform (similar to f dense in figure 3), was sufficient. Since each φ is quantized to integers, having a flexible scale and shift leads to flexible SQ, similar to in (Louizos, Reisser, et al, 2018). Due to the small size of the networks, more complex transformation functions lead to too much overhead.…”
Section: Mnist Experimentsmentioning
confidence: 99%
“…While these technically have a finite (but large) number of states, the best results in terms of both accuracy and bit rate are typically achieved for a significantly reduced number of states. Existing approaches to model compression often acknowledge this by quantizing each individual linear filter coefficient in an ANN to a small number of pre-determined values (Louizos, Reisser, et al, 2018;Baskin et al, 2018;F. Li et al, 2016).…”
We describe an end-to-end neural network weight compression approach that draws inspiration from recent latent-variable data compression methods. The network parameters (weights and biases) are represented in a "latent" space, amounting to a reparameterization. This space is equipped with a learned probability model, which is used to impose an entropy penalty on the parameter representation during training, and to compress the representation using arithmetic coding after training. We are thus maximizing accuracy and model compressibility jointly, in an endto-end fashion, with the rate-error trade-off specified by a hyperparameter. We evaluate our method by compressing six distinct model architectures on the MNIST, CIFAR-10 and ImageNet classification benchmarks. Our method achieves state-ofthe-art compression on VGG-16, LeNet300-100 and several ResNet architectures, and is competitive on LeNet-5.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.