Bit-pragmatic deep neural network computing

Albericio, Jorge; Delmas, Alberto; Judd, Patrick; Sharify, Sayeh; OrLeary, Gerard; Genov, Roman; Moshovos, Andreas

doi:10.1145/3123939.3123982

Cited by 203 publications

(87 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We use two baselines: the 1 st one is DaDianNao [9], the de facto design to report relative performance of diverse DCNN accelerators; the 2 nd one is a state-of-the-art bit-serial implementation [5] (PRA), it is also designed for computing essential bits of the activations, and we enroll its fp16 design on weights for fair comparison. We implement Tetris with two configurable modes, fp16 and int8, as mentioned in Section III.…”

Section: Discussionmentioning

confidence: 99%

Architecting Effectual Computation for Machine Learning Accelerators

Zhang

Han

et al. 2020

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

View full text Add to dashboard Cite

Inference efficiency is the predominant consideration in designing deep learning accelerators. Previous work mainly focuses on skipping zero values to deal with remarkable ineffectual computation, while zero bits in non-zero values, as another major source of ineffectual computation, is often ignored. The reason lies on the difficulty of extracting essential bits during operating multiply-and-accumulate (MAC) in the processing element. Based on the fact that zero bits occupy as high as 68.9% fraction in the overall weights of modern deep convolutional neural network models, this paper firstly proposes a weight kneading technique that could eliminate ineffectual computation caused by either zero value weights or zero bits in non-zero weights, simultaneously. Besides, a split-and-accumulate (SAC) computing pattern in replacement of conventional MAC, as well as the corresponding hardware accelerator design called Tetris are proposed to support weight kneading at the hardware level. Experimental results prove that Tetris could speed up inference up to 1.50x, and improve power efficiency up to 5.33x compared with the state-of-the-art baselines.

show abstract

Section: Discussionmentioning

confidence: 99%

Architecting Effectual Computation for Machine Learning Accelerators

Zhang

Han

et al. 2020

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

View full text Add to dashboard Cite

show abstract

“…This will significantly speed up the convolution operations [16]. (ii) A network quantized to fixed-point requires specialized integer arithmetic units (with various bitwidth) for efficient computing [1,18], whereas a network quantized with multiple binary bases adopts the same operations mentioned before as binary networks. Popular networks quantized with binary bases include Binary Networks and Multi-bit Networks.…”

Section: Related Workmentioning

confidence: 99%

Adaptive Loss-Aware Quantization for Multi-Bit Networks

Zhou

Cheng

et al. 2020

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

We investigate the compression of deep neural networks by quantizing their weights and activations into multiple binary bases, known as multi-bit networks (MBNs), which accelerate the inference and reduce the storage for the deployment on low-resource mobile and embedded platforms. We propose Adaptive Loss-aware Quantization (ALQ), a new MBN quantization pipeline that is able to achieve an average bitwidth below one-bit without notable loss in inference accuracy. Unlike previous MBN quantization solutions that train a quantizer by minimizing the error to reconstruct full precision weights, ALQ directly minimizes the quantizationinduced error on the loss function involving neither gradient approximation nor full precision maintenance. ALQ also exploits strategies including adaptive bitwidth, smooth bitwidth reduction, and iterative trained quantization to allow a smaller network size without loss in accuracy. Experiment results on popular image datasets show that ALQ outperforms state-of-the-art compressed networks in terms of both storage and accuracy.

show abstract

“…2) Simulation: Some prior work has also simulated machine learning workloads, but these papers used private simulators [33]- [36]. Since these simulators are not publicly available and few details are available, it is difficult to compare their approaches to ours.…”

Section: A Machine Learning Frameworkmentioning

confidence: 99%

Analyzing Machine Learning Workloads Using a Detailed GPU Simulator

Lew

Shah

Pati

et al. 2019

2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

View full text Add to dashboard Cite

Most deep neural networks deployed today are trained using GPUs via high-level frameworks such as Tensor-Flow [1] and PyTorch [2]. This paper describes changes we made to the GPGPU-Sim simulator [3], [4] to enable it to run PyTorch by running PTX kernels included in NVIDIA's cuDNN [5] library. We use the resulting modified simulator, which has been made available publicly with this paper 1 , to study some simple deep learning workloads. With our changes to GPGPU-Sim's functional simulation model we find GPGPU-Sim performance model running a cuDNN enabled implementation of LeNet for MNIST reports results within 30% of real hardware. Using GPGPU-Sim's AerialVision performance analysis tool we observe that cuDNN API calls contain many varying phases and appear to include potentially inefficient microarchitecture behavior such as DRAM partition bank camping, at least when executed on GPGPU-Sim's current performance model.

show abstract

Bit-pragmatic deep neural network computing

Cited by 203 publications

References 24 publications

Architecting Effectual Computation for Machine Learning Accelerators

Architecting Effectual Computation for Machine Learning Accelerators

Adaptive Loss-Aware Quantization for Multi-Bit Networks

Analyzing Machine Learning Workloads Using a Detailed GPU Simulator

Contact Info

Product

Resources

About