SparCE: Sparsity Aware General-Purpose Core Extensions to Accelerate Deep Neural Networks

Sen, Sanchari; Jain, Shubham; Venkataramani, Swagath; Raghunathan, Anand

doi:10.1109/tc.2018.2879434

Cited by 29 publications

(25 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The AFTx06 implements an RV32IM instruction set, integer multiplication and division, an integrated core local interrupt (CLINT) controller and a platform-level interrupt controller (PLIC), and it is expected to operate with a 100-MHz clock. A key feature of this microprocessor is the sparsity-aware core extensions (SparCE) architecture [2], a component that exploits sparsity in convolution arithmetic, allowing extraneous instructions (such as multiplication by zero; common in rectified linear unit functions) to be skipped at run time. The architecture has been designed to improve both the speed and power consumption of this common machine learning calculation.…”

Section: Sparsity-optimized Risc-v Soc: Combining Undergraduate Education and Soc Researchmentioning

confidence: 99%

Democratizing IC Design: The Story of a New Movement and the Launch of the SSCS PICO Program [Society News]

2021

IEEE Solid-State Circuits Mag.

View full text Add to dashboard Cite

Section: Sparsity-optimized Risc-v Soc: Combining Undergraduate Education and Soc Researchmentioning

confidence: 99%

Democratizing IC Design: The Story of a New Movement and the Launch of the SSCS PICO Program [Society News]

2021

IEEE Solid-State Circuits Mag.

View full text Add to dashboard Cite

“…SparCE [40] skips ineffectual code blocks based on a sparse input. It annotates skippable code blocks in software and tests conditions in hardware.…”

Section: Related Workmentioning

confidence: 99%

“…Indeed, prior efforts spanning hardware to software and algorithms have exploited sparsity to eliminate computation or data transfers at different points in DNN computations. Most of these efforts, though, require hardware changes [3,7,13,34,38,40,57] and/or apply only to inference [3,7,13,15,34,35,50,53,57]. This is not ideal, since most of real-world DNN computations are performed on conventional CPUs and GPUs [4,16,33,51], and significant time goes into training.…”

Section: Introductionmentioning

confidence: 99%

“…Second, operating on sparse data incurs overhead: modern machines are highly optimized for dense computations, and suffer from the extra indirections and branches that appear when processing sparse data. Prior work either relies on custom hardware to minimize these overheads [3,7,13,34,40,57], or sophisticated pre-processing to "shape" the sparsity pattern to better match existing hardware [35,50,53]-which only applies to static sparsity.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

SparseTrain

Gong¹,

Ji²,

Fletcher³

et al. 2020

Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

Our community has improved the efficiency of deep learning applications by exploiting sparsity in inputs. Most of that work, though, is for inference, where weight sparsity is known statically, and/or for specialized hardware. In this paper, we propose SparseTrain, a software-only scheme to leverage dynamic sparsity during training on general-purpose SIMD processors. SparseTrain exploits zeros introduced by the ReLU activation function to both feature maps and their gradients. Exploiting such sparsity is challenging because the sparsity degree is moderate and the locations of zeros change over time. SparseTrain identifies zeros in a dense data representation and performs vectorized computation. Variations of the scheme are applicable to all major components of training: forward propagation, backward propagation by inputs, and backward propagation by weights. Our experiments on a 6-core Intel Skylake-X server show that SparseTrain is very effective. In end-to-end training of VGG16, ResNet-34, and ResNet-50 with ImageNet, SparseTrain outperforms a highly-optimized direct convolution on the non-initial convolutional layers by 2.19x, 1.37x, and 1.31x, respectively. SparseTrain also benefits inference. It accelerates the non-initial convolutional layers of the aforementioned models by 1.88x, 1.64x, and 1.44x, respectively. CCS CONCEPTS • Computing methodologies → Neural networks; Shared memory algorithms; Vector / streaming algorithms.

show abstract

“…Finally, we compare our work on exploiting zeros in modern gaming applications with prior art on leveraging sparsity (i.e., memory loads and computations returning zeros) in deep neural networks (DNNs). Several researchers have exploited sparsity to improve DNN performance in accelerator architectures [1,34,50,65], GPUs [51], and general-purpose processors [56] through hardware enhancements. Like these efforts, Zeroploit seeks to improve performance by leveraging zero valued operands.…”

Section: Related Workmentioning

confidence: 99%

Zeroploit

Rangan¹,

Stephenson

Ukarande³

et al. 2020

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

In this article, we first characterize register operand value locality in shader programs of modern gaming applications and observe that there is a high likelihood of one of the register operands of several multiply, logical-and, and similar operations being zero, dynamically. We provide intuition, examples, and a quantitative characterization for how zeros originate dynamically in these programs. Next, we show that this dynamic behavior can be gainfully exploited with a profile-guided code optimization called Zeroploit that transforms targeted code regions into a zero-(value-)specialized fast path and a default slow path. The fast path benefits from zero-specialization in two ways, namely: (a) the backward slice of the other operand of a given multiply or logical-and can be skipped dynamically, provided the only use of that other operand is in the given instruction, and (b) the forward slice of instructions originating at the given instruction can be zerospecialized, potentially triggering further backward slice specializations from operations of that forward slice as well. Such specialization helps the fast path avoid redundant dynamic computations as well as memory fetches, while the fast-slow versioning transform helps preserve functional correctness. With an offline value profiler and manually optimized shader programs, we demonstrate that Zeroploit is able to achieve an average speedup of 35.8% for targeted shader programs, amounting to an average frame-rate speedup of 2.8% across a collection of modern gaming applications on an NVIDIA® GeForce RTX TM 2080 GPU.

show abstract

SparCE: Sparsity Aware General-Purpose Core Extensions to Accelerate Deep Neural Networks

Cited by 29 publications

References 32 publications

Democratizing IC Design: The Story of a New Movement and the Launch of the SSCS PICO Program [Society News]

Democratizing IC Design: The Story of a New Movement and the Launch of the SSCS PICO Program [Society News]

SparseTrain

Zeroploit

Contact Info

Product

Resources

About