E-BATCH: Energy-Efficient and High-Throughput RNN Batching

Silfa, Franyell; Arnau, Jose Maria; González, Antonio

doi:10.1145/3499757

Cited by 7 publications

(23 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As depicted, each LSTM gate performs two matrix-vector multiplications (MVMs), which finally decide how to update the cell-state (c t ), and how to generate the hidden output vector (h t ) that is recurrently sent to the following time step. Two kinds of dependencies exist in these The figure shows E-PUR's [21] speedup running EESEN [8], for a range of MAC units. Due to the scalability issue, it does not achieve performance improvement proportional to the increase in resources.…”

Section: Rnn Backgroundmentioning

confidence: 99%

“…For instance, NPUs [24,28] have the parallel multiply-accumulation (MAC) stage as the heart of their pipeline and are not optimized in case the serial part becomes the performance bottleneck for some models. On the other hand, customized accelerators [21,29] use a relatively small resource budget which therefore causes large delay for MVM, hence overlap the remaining LSTM computation that needs to run sequentially. However, when using more MACs, the issue of efficiently handling LSTM's dependencies still remains.…”

Section: Challenges and Opportunitiesmentioning

confidence: 99%

“…To meet the requirements of real-time inference at large scale, a high-performance and energy efficient accelerator for RNN is highly desired. However, two reasons make it very difficult to accomplish efficient RNN computation by CPUs or GPUs in parallel [21,22]: (1) recurrent behaviour of RNN architecture which imposes several data-dependencies, (2) limited parallel tasks due to the enforced low batch size by Service-Level Agreements (SLAs) in the inference evaluation [23,24]. FLOP Efficiency (%)…”

Section: Introductionmentioning

confidence: 99%

“…Because the performance of RNN on general-purpose processors has difficulties to meet the requirements of realtime inference in modern applications, accelerating RNN through either customized architectures [21,27] or neural processing units (NPU) [24,28] has been recently explored. These systems are implemented on either ASICs or FPGAs.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

SHARP: An Adaptable, Energy-Efficient Accelerator for Recurrent Neural Networks

Yazdani

Ruwase

Zhang

et al. 2023

ACM Trans. Embed. Comput. Syst.

View full text Add to dashboard Cite

The effectiveness of Recurrent Neural Networks (RNNs) for tasks such as Automatic Speech Recognition has fostered interest in RNN inference acceleration. Due to the recurrent nature and data dependencies of RNN computations, prior work has designed customized architectures specifically tailored to the computation pattern of RNN, getting high computation efficiency for certain chosen model sizes. However, given that the dimensionality of RNNs varies a lot for different tasks, it is crucial to generalize this efficiency to diverse configurations. In this work, we identify adaptiveness as a key feature that is missing from today’s RNN accelerators. In particular, we first show the problem of low resource-utilization and low adaptiveness for the state-of-the-art RNN implementations on GPU, FPGA and ASIC architectures. To solve these issues, we propose an intelligent tiled-based dispatching mechanism for increasing the adaptiveness of RNN computation, in order to efficiently handle the data dependencies. To do so, we propose Sharp as a hardware accelerator, which pipelines RNN computation using an effective scheduling scheme to hide most of the dependent serialization. Furthermore, Sharp employs dynamic reconfigurable architecture to adapt to the model’s characteristics. Sharp achieves 2x, 2.8x, and 82x speedups on average, considering different RNN models and resource budgets, compared to the state-of-the-art ASIC, FPGA, and GPU implementations, respectively. Furthermore, we provide significant energy-reduction with respect to the previous solutions, due to the low power dissipation of Sharp (321 GFLOPS/Watt).

show abstract

Section: Rnn Backgroundmentioning

confidence: 99%

Section: Challenges and Opportunitiesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

SHARP: An Adaptable, Energy-Efficient Accelerator for Recurrent Neural Networks

Yazdani

Ruwase

Zhang

et al. 2023

ACM Trans. Embed. Comput. Syst.

View full text Add to dashboard Cite

show abstract

“…Meanwhile, based on the learning level set of the mean simulator response, several new schemes were developed, including Multi-level Batching (MLB), Ratchet Batching (RB), Adaptive Batched Stepwise Uncertainty Reduction (ABSUR), Adaptive Design with the Stepwise Allocation (ADSA), and Deterministic Design with the Stepwise Allocation (DDSA). MLB, RB, and ABSUR simultaneously or ADSA and DDSA sequentially determine the sequential design inputs and the respective number of replicates [3]. By Bermudan option pricing via Monte Carlo regression, quantitative applications in many financial instances showed that the method has a significant computational speed and a low distortion rate.…”

Section: Introductionmentioning

confidence: 99%

Complex Theory and Batch Processing in Mechanical Systemic Data Extraction

Xue-fang

Hong-xia

et al. 2022

IEEE Access

View full text Add to dashboard Cite

This paper designs a new batching program to extract the original data, which helps to traverse the entire sample space quickly and provides a new approach for data extraction based on the motion stroke diagram. The designed program can read thousands of files out of many folders instantly and automatically. Meanwhile, thousands of time nodes are calculated based on the proportional coefficients. Finally, experimental data of many folders are separated into sample space easily and rapidly. The original data, the extracted stroke data, and the name and address of every folder and file are output in the result. The program designed in this paper at the maximum processing speed needs only 0.015 seconds to read, compute, and extract the correlative data information from one file (s/f), and the average time threshold is 0.0866 seconds. The Linear Theory (LT), optimizing Sparrow Search Algorithm (OSSA), and the automatic batch read file method can be employed to obtain the optimal result of data extraction. Through 744 rounds of nine experiments, the Average Processing Speed (APS) is less than 0.110 seconds per second data segment (s/ess). The APS is increased by 79.73%. The accuracy of fault classification and forecast of the eigenvalue extracted from the Automatic Batch Reading File (ABRF) and the Ensemble Empirical Mode Decomposition (EEMD) method is improved by 9% and 13% by Self-Organized Mapping (SOM). It is validated that our proposed data extraction method is faster and more progressive than the existing ones.

show abstract

Auto Batching Scheme for Optimizing LSTM Inference on FPGA Platforms

Jin Kim,

Chung

2024

IEEE Access

View full text Add to dashboard Cite

E-BATCH: Energy-Efficient and High-Throughput RNN Batching

Cited by 7 publications

References 24 publications

SHARP: An Adaptable, Energy-Efficient Accelerator for Recurrent Neural Networks

SHARP: An Adaptable, Energy-Efficient Accelerator for Recurrent Neural Networks

Complex Theory and Batch Processing in Mechanical Systemic Data Extraction

Auto Batching Scheme for Optimizing LSTM Inference on FPGA Platforms

Contact Info

Product

Resources

About