A Survey of Quantization Methods for Efficient Neural Network Inference

Gholami, Amir; Kim, Sehoon; Dong, Zhen; Yao, Zhewei; Mahoney, Michael W.; Keutzer, Kurt

doi:10.1201/9781003162810-13

Cited by 460 publications

(169 citation statements)

References 226 publications

(324 reference statements)

Supporting

Mentioning

167

Contrasting

Unclassified

Order By: Relevance

“…On the other hand, if f [0] is negative, the sign extension part are all 1s in binary expression and represents -1 in 2's complementary representation. In such condition, we decrement 1 from f [1] to form the second S-bit and perform the packing process with concatenation and 1-bit incrementer instead of using a larger bitwidth adder. The packing process works recursively for all the slices while slicing of the output works in a reversed manner.…”

Section: A From Multiplication To Convolutionmentioning

confidence: 99%

See 1 more Smart Citation

HiKonv: High Throughput Quantized Convolution With Novel Bit-wise Management and Computation

Liu¹,

Chen²,

Ganesh³

et al. 2021

Preprint

View full text Add to dashboard Cite

Quantization for Convolutional Neural Network (CNN) has shown significant progress with the intention of reducing the cost of computation and storage with low-bitwidth data inputs. There are, however, no systematic studies on how an existing full-bitwidth processing unit, such as CPUs and DSPs, can be better utilized to carry out significantly higher computation throughput for convolution under various quantized bitwidths. In this study, we propose HiKonv, a unified solution that maximizes the compute throughput of a given underlying processing unit to process low-bitwidth quantized data inputs through novel bitwise parallel computation. We establish theoretical performance bounds using a full-bitwidth multiplier for highly parallelized low-bitwidth convolution, and demonstrate new breakthroughs for high-performance computing in this critical domain. For example, a single 32-bit processing unit can deliver 128 binarized convolution operations (multiplications and additions) under one CPU instruction, and a single 27×18 DSP core can deliver eight convolution operations with 4-bit inputs in one cycle. We demonstrate the effectiveness of HiKonv on CPU and FPGA for both convolutional layers or a complete DNN model. For a convolutional layer quantized to 4-bit, HiKonv achieves a 3.17× latency improvement over the baseline implementation using C++ on CPU. Compared to the DAC-SDC 2020 champion model for FPGA, HiKonv achieves a 2.37× throughput improvement and 2.61× DSP efficiency improvement, respectively.

show abstract

Section: A From Multiplication To Convolutionmentioning

confidence: 99%

“…Quantization is a frequently used technique in hardware implementation of Deep Neural Network (DNN) models in order to reduce both the memory consumption and execution time [1]- [6]. It is typically done by approximating highprecision floating point numbers to low-bitwidth integers or fixed-point numbers.…”

Section: Introductionmentioning

confidence: 99%

HiKonv: High Throughput Quantized Convolution With Novel Bit-wise Management and Computation

Liu¹,

Chen²,

Ganesh³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…†Work done when the author was at Microsoft. Figure 1: Growth of DNN model size and GPU memory capacity over the past decade [14,57]. Memory consumed here only accounts for model state which is a small fraction of total training memory footprint [7,14,31,63,68,74].…”

Section: Introductionmentioning

confidence: 99%

Harmony: Overcoming the hurdles of GPU memory capacity to train massive DNN models on commodity servers

Li¹,

Phanishayee²,

Murray³

et al. 2022

Preprint

View full text Add to dashboard Cite

Deep neural networks (DNNs) have grown exponentially in complexity and size over the past decade, leaving only those who have access to massive datacenter-based resources with the ability to develop and train such models. One of the main challenges for the long tail of researchers who might have access to only limited resources (e.g., a single multi-GPU server) is limited GPU memory capacity compared to model size. The problem is so acute that the memory requirement of training large DNN models can often exceed the aggregate capacity of all available GPUs on commodity servers; this problem only gets worse with the trend of ever-growing model sizes. Current solutions that rely on virtualizing GPU memory (by swapping to/from CPU memory) incur excessive swapping overhead. In this paper, we present a new training framework, Harmony, and advocate rethinking how DNN frameworks schedule computation and move data to push the boundaries of training large models efficiently on modest multi-GPU deployments. Across many large DNN models, Harmony is able to reduce swap load by up to two orders of magnitude and obtain a training throughput speedup of up to 7.6× over highly optimized baselines with virtualized memory.

show abstract

“…It is required to compress these neural networks. Quantization is one of the most effective ways to compress neural networks [8]. The floating-point values are quantized to integers with a low bit-width, reducing the memory consumption and the computation cost.…”

Section: Introductionmentioning

confidence: 99%

PTQ4ViT: Post-Training Quantization Framework for Vision Transformers

Yuan¹,

Chen²,

Wu³

et al. 2021

Preprint

View full text Add to dashboard Cite

Quantization is one of the most effective methods to compress neural networks, which has achieved great success on convolutional neural networks (CNNs). Recently, vision transformers have demonstrated great potential in computer vision. However, previous post-training quantization methods performed not well on vision transformer, resulting in more than 1% accuracy drop even in 8-bit quantization. Therefore, we analyze the problems of quantization on vision transformers. We observe the distributions of activation values after softmax and GELU functions are quite different from the Gaussian distribution. We also observe that common quantization metrics, such as MSE and cosine distance, are inaccurate to determine the optimal scaling factor. In this paper, we propose the twin uniform quantization method to reduce the quantization error on these activation values. And we propose to use a Hessian guided metric to evaluate different scaling factors, which improves the accuracy of calibration with a small cost. To enable the fast quantization of vision transformers, we develop an efficient framework, PTQ4ViT. Experiments show the quantized vision transformers achieve near-lossless prediction accuracy (less than 0.5% drop at 8-bit quantization) on the ImageNet classification task.

show abstract

A Survey of Quantization Methods for Efficient Neural Network Inference

Cited by 460 publications

References 226 publications

HiKonv: High Throughput Quantized Convolution With Novel Bit-wise Management and Computation

HiKonv: High Throughput Quantized Convolution With Novel Bit-wise Management and Computation

Harmony: Overcoming the hurdles of GPU memory capacity to train massive DNN models on commodity servers

PTQ4ViT: Post-Training Quantization Framework for Vision Transformers

Contact Info

Product

Resources

About