Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced Sparsity

Cao, Shi-Jie; Zhang, Chen; Yao, Zhuliang; Xiao, Wencong; Nie, Lei; Zhan, Dechen; Liu, Yunxin; Wu, Ming; Zhang, Lintao

doi:10.1145/3289602.3293898

Cited by 152 publications

(116 citation statements)

References 17 publications

Supporting

Mentioning

115

Contrasting

Order By: Relevance

“…Many accelerators targeting FPGAs have taken advantage of sparsity for Fully Connected (FC) or Long Short Term Memory (LSTM) units [3,4,5,6]. But fewer have taken advantage of sparsity in convolutional layers.…”

Section: Related Workmentioning

confidence: 99%

HPIPE: Heterogeneous Layer-Pipelined and Sparse-Aware CNN Inference for FPGAs

Hall

Betz

2020

Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

View full text Add to dashboard Cite

We present both a novel Convolutional Neural Network (CNN) accelerator architecture and a network compiler for FPGAs that outperforms all prior work. Instead of having generic processing elements that together process one layer at a time, our network compiler statically partitions available device resources and builds custom-tailored hardware for each layer of a CNN. By building hardware for each layer we can pack our controllers into fewer lookup tables and use dedicated routing. These efficiencies enable our accelerator to utilize 2x the DSPs and operate at more than 2x the frequency of prior work on sparse CNN acceleration on FPGAs. We evaluate the performance of our architecture on both sparse Resnet-50 and dense MobileNet Imagenet classifiers on a Stratix 10 2800 FPGA. We find that the sparse Resnet-50 model has throughput at a batch size of 1 of 4550 images/s, which is nearly 4x the throughput of NVIDIA's fastest machine learning targeted GPU, the V100, and outperforms all prior work on FPGAs.

show abstract

Section: Related Workmentioning

confidence: 99%

HPIPE: Heterogeneous Layer-Pipelined and Sparse-Aware CNN Inference for FPGAs

Hall

Betz

2020

Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

View full text Add to dashboard Cite

show abstract

“…One of the most popular approaches to obtain more energy efficient inference for neural networks is through custom hardware accelerators, targeting field-programmable gate arrays (FPGAs) [15,19,39] or application-specific integrated circuit (ASICs) [3,6,28,40]. These are custom-built architectures that optimize the most energy-intensive operations involved in the inference process (typically multiply-and-accumulate loops).…”

Section: Custom Hardware Designsmentioning

confidence: 99%

“…These are custom-built architectures that optimize the most energy-intensive operations involved in the inference process (typically multiply-and-accumulate loops). The majority of custom accelerator designs have been proposed for convolutional neural networks (CNNs), particularly for image processing, while fewer works have targeted sequence to sequence architectures [18,19,41], such as the ones considered in this work. While hardware accelerators are able to improve the energy efficiency by several orders of magnitude, they are mostly suitable for high-end applications, for which a heterogeneous systems-on-chip with dedicated hardware blocks for a specific functionality can be afforded.…”

Section: Custom Hardware Designsmentioning

confidence: 99%

“…While for most applications training is a one-time task, and can therefore be performed in the cloud, there is a growing demand for executing NN inference on embedded systems (so-called "edge" nodes), in order to enhance the features of many Internet of Things (IoT) applications [3]. In fact, edge inference could yield benefits in terms of data privacy, response latency and energy efficiency, as it would eliminate the need of transmitting high volumes of raw data to the cloud [4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19].…”

Section: Introductionmentioning

confidence: 99%

“…One of the most popular approaches is to design custom hardware accelerators to implement the most critical operations involved in the inference phase, which are typically multiplications of large matrices and vectors, in a fast and efficient way. Most accelerators have been designed for convolutional neural networks (CNNs), due to their outstanding results in computer vision applications [3,6,7,12,16], but more recently, hardware acceleration of sequence-to-sequence models, such as RNNs and transformers, has also been investigated extensively [10,15,[17][18][19].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Sequence-To-Sequence Neural Networks Inference on Embedded Processors Using Dynamic Beam Search

2020

View full text Add to dashboard Cite

Sequence-to-sequence deep neural networks have become the state of the art for a variety of machine learning applications, ranging from neural machine translation (NMT) to speech recognition. Many mobile and Internet of Things (IoT) applications would benefit from the ability of performing sequence-to-sequence inference directly in embedded devices, thereby reducing the amount of raw data transmitted to the cloud, and obtaining benefits in terms of response latency, energy consumption and security. However, due to the high computational complexity of these models, specific optimization techniques are needed to achieve acceptable performance and energy consumption on single-core embedded processors. In this paper, we present a new optimization technique called dynamic beam search, in which the inference complexity is tuned to the difficulty of the processed input sequence at runtime. Results based on measurements on a real embedded device, and on three state-of-the-art deep learning models, show that our method is able to reduce the inference time and energy by up to 25% without loss of accuracy.

show abstract

Ramanujan bipartite graph products for efficient block sparse neural networks

Vooturi

Varma

Kothapalli

2021

Concurrency and Computation

View full text Add to dashboard Cite

Summary Sparse neural networks are shown to give accurate predictions competitive to denser versions, while also minimizing the number of arithmetic operations performed. However current GPU hardware can only exploit structured sparsity patterns for better efficiency. We propose a framework for generating structured multilevel block sparse neural networks by using the theory of graph products. Our Ramanujan bipartite graph product (RBGP) framework uses products of Ramanujan graphs to obtain the best connectivity for a given level of sparsity. This essentially ensures that the i.) the networks has the structured block sparsity for which runtime efficient algorithms exists, ii.) the model gives high prediction accuracy, due to the better expressive power derived from the connectivity of the graph, iii.) the graph data structure has a succinct representation that can be stored efficiently in memory. We use our framework to design a specific connectivity pattern called RBGP4 which makes efficient use of the memory hierarchy available on GPU. We benchmark our approach on image classification and machine translation tasks with an edge (Jetson Nano 2GB) as well as server (V100) GPUs. When compared with commonly used sparsity patterns like unstructured and block, we obtain significant speedups while achieving the same level of accuracy.

show abstract

Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced Sparsity

Cited by 152 publications

References 17 publications

HPIPE: Heterogeneous Layer-Pipelined and Sparse-Aware CNN Inference for FPGAs

HPIPE: Heterogeneous Layer-Pipelined and Sparse-Aware CNN Inference for FPGAs

Sequence-To-Sequence Neural Networks Inference on Embedded Processors Using Dynamic Beam Search

Ramanujan bipartite graph products for efficient block sparse neural networks

Contact Info

Product

Resources

About