Post-training 4-bit quantization of convolution networks for rapid-deployment

Banner, Ron; Nahshan, Yury; Hoffer, Elad; Soudry, Daniel

doi:10.48550/arxiv.1810.05723

Cited by 36 publications

(69 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Hence, much attention has recently been dedicated to post-training quantization schemes, which directly quantize pretrained DNNs, with real-valued weights, without retraining. These quantization methods either rely on a small amount of data [1,3,35,24,14,31,19,22] or can be implemented without accessing training data, i.e. data-free compression [23,2,33,20].…”

Section: Introductionmentioning

confidence: 99%

“…Let S ⊆ R n be a Borel set. Unif(S) denotes the uniform distribution over S. A L-layer multi-layer perceptron, Φ, acts on a vector x ∈ R N 0 via (1) Φ(x) := ϕ (L) • A (L) • • • • • ϕ (1) • A (1) (x)…”

Section: Introductionmentioning

confidence: 99%

“…∈ Z} where δ > 0 denotes the quantization step size. For example, A 2 1 = {±1} is a 1-bit alphabet while ‹ A 1 1 = {0, ±1} is a ternary alphabet. Moreover, the following augmented midrise alphabets can also be used in practice: (4) Âδ K := A δ K ∪ {0}.…”

Section: Introductionmentioning

confidence: 99%

“…Nevertheless, these works use different strategies for determining the quantization bins. For example, Banner et al [1] (see also [35]) choose the thresholds to minimize a MSE metric. Their numerical results also show that for convolutional networks using different quantization thresholds "per-channel" and bias correction can improve the accuracy of quantized models.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Post-training Quantization for Neural Networks with Provable Guarantees

Zhang¹,

Zhou²,

Saab³

2022

Preprint

View full text Add to dashboard Cite

While neural networks have been remarkably successful in a wide array of applications, implementing them in resource-constrained hardware remains an area of intense research. By replacing the weights of a neural network with quantized (e.g., 4-bit, or binary) counterparts, massive savings in computation cost, memory, and power consumption are attained. We modify a post-training neural-network quantization method, GPFQ, that is based on a greedy path-following mechanism, and rigorously analyze its error. We prove that for quantizing a single-layer network, the relative square error essentially decays linearly in the number of weights -i.e., level of overparametrization. Our result holds across a range of input distributions and for both fully-connected and convolutional architectures. To empirically evaluate the method, we quantize several common architectures with few bits per weight, and test them on ImageNet, showing only minor loss of accuracy. We also demonstrate that standard modifications, such as bias correction and mixed precision quantization, further improve accuracy.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Post-training Quantization for Neural Networks with Provable Guarantees

Zhang¹,

Zhou²,

Saab³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Since weight sharing does not modify the structure nor the precision of the network, it can be combined with other compression techniques like pruning and quantization to further improve compression ratio and runtime. Furthermore, recent works (Choi et al, 2018;Banner et al, 2018) have shown that sub-byte quantization of weights and/or activations can achieve inference accuracies comparable to full-precision networks.…”

Section: Introductionmentioning

confidence: 99%

Bit-serial Weight Pools: Compression and Arbitrary Precision Execution of Neural Networks on Resource Constrained Processors

Li¹,

Gupta²

2022

Preprint

View full text Add to dashboard Cite

Applications of neural networks on edge systems have proliferated in recent years but the ever increasing model size makes neural networks not able to deploy on resource-constrained microcontrollers efficiently. We propose bit-serial weight pools, an end-to-end framework that includes network compression and acceleration of arbitrary sub-byte precision. The framework can achieve up to 8× compression compared to 8-bit networks by sharing a pool of weights across the entire network. We further propose a bit-serial lookup based software implementation that allows runtime-bitwidth tradeoff and is able to achieve more than 2.8× speedup and 7.5× storage compression compared to 8-bit weight pool networks, with less than 1% accuracy drop.

show abstract

Differentiable Soft Quantization: Bridging Full-Precision and Low-Bit Neural Networks

Gong

Liu

Jiang

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

359

246

View full text Add to dashboard Cite

Hardware-friendly network quantization (e.g., binary/uniform quantization) can efficiently accelerate the inference and meanwhile reduce memory consumption of the deep neural networks, which is crucial for model deployment on resource-limited devices like mobile phones. However, due to the discreteness of low-bit quantization, existing quantization methods often face the unstable training process and severe performance degradation. To address this problem, in this paper we propose Differentiable Soft Quantization (DSQ) to bridge the gap between the full-precision and low-bit networks. DSQ can automatically evolve during training to gradually approximate the standard quantization. Owing to its differentiable property, DSQ can help pursue the accurate gradients in backward propagation, and reduce the quantization loss in forward process with an appropriate clipping range. Extensive experiments over several popular network structures show that training lowbit neural networks with DSQ can consistently outperform state-of-the-art quantization methods. Besides, our first efficient implementation for deploying 2 to 4-bit DSQ on devices with ARM architecture achieves up to 1.7× speed up, compared with the open-source 8-bit high-performance inference framework NCNN [31].

show abstract

Post-training 4-bit quantization of convolution networks for rapid-deployment

Cited by 36 publications

References 13 publications

Post-training Quantization for Neural Networks with Provable Guarantees

Post-training Quantization for Neural Networks with Provable Guarantees

Bit-serial Weight Pools: Compression and Arbitrary Precision Execution of Neural Networks on Resource Constrained Processors

Differentiable Soft Quantization: Bridging Full-Precision and Low-Bit Neural Networks

Contact Info

Product

Resources

About