2018
DOI: 10.48550/arxiv.1810.05723
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Post-training 4-bit quantization of convolution networks for rapid-deployment

Ron Banner,
Yury Nahshan,
Elad Hoffer
et al.

Abstract: Convolutional neural networks require significant memory bandwidth and storage for intermediate computations, apart from substantial computing resources. Neural network quantization has significant benefits in reducing the amount of intermediate results, but it often requires the full datasets and time-consuming fine tuning to recover the accuracy lost after quantization. This paper introduces the first practical 4-bit post training quantization approach: it does not involve training the quantized model (fine-… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

1
68
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 36 publications
(69 citation statements)
references
References 13 publications
1
68
0
Order By: Relevance
“…Hence, much attention has recently been dedicated to post-training quantization schemes, which directly quantize pretrained DNNs, with real-valued weights, without retraining. These quantization methods either rely on a small amount of data [1,3,35,24,14,31,19,22] or can be implemented without accessing training data, i.e. data-free compression [23,2,33,20].…”
Section: Introductionmentioning
confidence: 99%
See 3 more Smart Citations
“…Hence, much attention has recently been dedicated to post-training quantization schemes, which directly quantize pretrained DNNs, with real-valued weights, without retraining. These quantization methods either rely on a small amount of data [1,3,35,24,14,31,19,22] or can be implemented without accessing training data, i.e. data-free compression [23,2,33,20].…”
Section: Introductionmentioning
confidence: 99%
“…Let S ⊆ R n be a Borel set. Unif(S) denotes the uniform distribution over S. A L-layer multi-layer perceptron, Φ, acts on a vector x ∈ R N 0 via (1) Φ(x) := ϕ (L) • A (L) • • • • • ϕ (1) • A (1) (x)…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations
“…Since weight sharing does not modify the structure nor the precision of the network, it can be combined with other compression techniques like pruning and quantization to further improve compression ratio and runtime. Furthermore, recent works (Choi et al, 2018;Banner et al, 2018) have shown that sub-byte quantization of weights and/or activations can achieve inference accuracies comparable to full-precision networks.…”
Section: Introductionmentioning
confidence: 99%