Ristretto: A Framework for Empirical Study of Resource-Efficient Inference in Convolutional Neural Networks

Gysel, Philipp; Pimentel, Jon J.; Motamedi, Mohammad; Ghiasi, Soheil

doi:10.1109/tnnls.2018.2808319

Cited by 223 publications

(133 citation statements)

References 8 publications

Supporting

Mentioning

130

Contrasting

Order By: Relevance

“…There are several works that describe quantization and improving networks for lower bit inference and deployment [9,10,16,34]. These methods all rely strongly on finetuning, making them level 3 methods, whereas data-free quantization improves performance similarly without that requirement.…”

Section: Background and Related Workmentioning

confidence: 99%

Data-Free Quantization Through Weight Equalization and Bias Correction

Nagel¹,

Baalen²,

Blankevoort³

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

390

331

View full text Add to dashboard Cite

We introduce a data-free quantization method for deep neural networks that does not require fine-tuning or hyperparameter selection. It achieves near-original model performance on common computer vision architectures and tasks. 8-bit fixed-point quantization is essential for efficient inference in modern deep learning hardware architectures. However, quantizing models to run in 8-bit is a non-trivial task, frequently leading to either significant performance reduction or engineering time spent on training a network to be amenable to quantization. Our approach relies on equalizing the weight ranges in the network by making use of a scale-equivariance property of activation functions. In addition the method corrects biases in the error that are introduced during quantization. This improves quantization accuracy performance, and can be applied ubiquitously to almost any model with a straight-forward API call. For common architectures, such as the MobileNet family, we achieve state-of-the-art quantized model performance. We further show that the method also extends to other computer vision architectures and tasks such as semantic segmentation and object detection.

show abstract

Section: Background and Related Workmentioning

confidence: 99%

Data-Free Quantization Through Weight Equalization and Bias Correction

Nagel¹,

Baalen²,

Blankevoort³

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

390

331

View full text Add to dashboard Cite

show abstract

“…Several groups have proposed compressed new compute-and memory-efficient DNN architectures [4]-[6] and parameter-efficient neural networks, using methods such as DNN pruning [7], distillation [8], and low-precision arithmetic [9], [10]. Among these approaches, low-precision arithmetic is noted for its ability to reduce memory capacity, bandwidth, latency, and energy consumption associated with MAC units in DNNs and an increase in the level of data parallelism [9], [11], [12].The ultimate goal of low-precision DNN design is to reduce the original hardware complexity of the high-precision DNN model to a level suitable for edge devices without significantly degrading performance.To address the gaps in previous studies, we are motivated to study low-precision posit for DNN training on the edge.…”

mentioning

confidence: 99%

Understanding the impact of precision quantization on the accuracy and energy of neural networks

Hashemi

Anthony

Tann

et al. 2017

Design, Automation &Amp; Test in Europe Conference &Amp; Exhibition (DATE), 2017

View full text Add to dashboard Cite

Recently, the posit numerical format has shown promise for DNN data representation and compute with ultralow precision ([5..8]-bit). However, majority of studies focus only on DNN inference. In this work, we propose DNN training using posits and compare with the floating point training. We evaluate on both MNIST and Fashion MNIST corpuses, where 16-bit posits outperform 16-bit floating point for end-to-end DNN training.Index Terms-Deep neural networks, low-precision arithmetic, posit numerical format I. INTRODUCTIONThe edge computing, offers a decentralized solution to cloud-based datacenters [1] and intelligence-at-the-edge of mobile networks. However, training on the edge is a challenge for many deep neural networks (DNNs). This arises due to the significant cost of multiply-and-accumulate (MAC) units, an ubiquitous operation in all DNNs. In a 45 nm CMOS process, energy consumption doubles from 16-bit floats to 32-bit floats for addition and by ∼4× for multiplication [2]. Memory access cost increases by ∼10× from 8 kB to 1 MB memory with a 64-bit cache [2]. In general, there is a gap between memory storage, bandwidth, compute requirements, and energy consumption of modern DNNs and hardware resources available on edge devices [3].An apparent solution to address this gap is to compress such networks, thus reducing the compute requirements to match putative edge resources. Several groups have proposed compressed new compute-and memory-efficient DNN architectures [4]-[6] and parameter-efficient neural networks, using methods such as DNN pruning [7], distillation [8], and low-precision arithmetic [9], [10]. Among these approaches, low-precision arithmetic is noted for its ability to reduce memory capacity, bandwidth, latency, and energy consumption associated with MAC units in DNNs and an increase in the level of data parallelism [9], [11], [12].The ultimate goal of low-precision DNN design is to reduce the original hardware complexity of the high-precision DNN model to a level suitable for edge devices without significantly degrading performance.To address the gaps in previous studies, we are motivated to study low-precision posit for DNN training on the edge. II. POSIT NUMERICAL FORMATAn alternative to IEEE-754 floating point numbers, posits were recently introduced and exhibit a tapered-precision char-

show abstract

“…Vanhoucke et al [10] linearly normalizes weights and (sigmoid) activations of every layer in a speed-recognition NN to 8-bit by analysing the range of weights and activations. A similar approach is implemented in several deep learning frameworks such as Tensorflow [11] and Caffe-Ristretto [12]. Lin, Talathi, and Annapureddy [6] propose an analytical model to quickly convert pre-trained models to fixed-point.…”

Section: Related Workmentioning

confidence: 99%

“…To reduce the number of solutions to consider, several heuristics were developed. Gysel et al [12] propose an iterative quantization procedure where weights are quantized first, and activations are quantized second. A similar two-step approach is described by other related works [13].…”

Section: Related Workmentioning

confidence: 99%

Quantization of Constrained Processor Data Paths Applied to Convolutional Neural Networks

Bruin¹,

Zivkovic

Corporaal³

2018

2018 21st Euromicro Conference on Digital System Design (DSD)

View full text Add to dashboard Cite

Artificial Neural Networks (NNs) can effectively be used to solve many classification and regression problems, and deliver state-of-the-art performance in the application domains of natural language processing (NLP) and computer vision (CV). However, the tremendous amount of data movement and excessive convolutional workload of these networks hampers large-scale mobile and embedded productization. Therefore these models are generally mapped to energy-efficient accelerators without floating-point support. Weight and data quantization is an effective way to deploy high-precision models to efficient integer-based platforms. In this paper a quantization method for platforms without wide accumulation registers is being proposed. Two constraints to maximize the bit width of weights and input data for a given accumulator size are introduced. These constraints exploit knowledge about the weight and data distribution of individual layers. Using these constraints, we propose a layer-wise quantization heuristic to find a good fixed-point network approximation. To reduce the number of configurations to consider, only solutions that fully utilize the available accumulator bits are being tested. We demonstrate that 16-bit accumulators are able to obtain a Top-1 classification accuracy within 1% of the floating-point baselines on the CIFAR-10 and ILSVRC2012 image classification benchmarks.

show abstract

Ristretto: A Framework for Empirical Study of Resource-Efficient Inference in Convolutional Neural Networks

Cited by 223 publications

References 8 publications

Data-Free Quantization Through Weight Equalization and Bias Correction

Data-Free Quantization Through Weight Equalization and Bias Correction

Understanding the impact of precision quantization on the accuracy and energy of neural networks

Quantization of Constrained Processor Data Paths Applied to Convolutional Neural Networks

Contact Info

Product

Resources

About