GRNN

Holmes, Connor; Mawhirter, Daniel; He, Yuxiong; Yan, Feng; Wu, Bo

doi:10.1145/3302424.3303949

Cited by 36 publications

(12 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We compare SHARP against the state-of-the-art GPU, FPGA and ASIC implementations, i.e. cuDNN [20], GRNN [23],…”

Section: Resultsmentioning

confidence: 99%

“…To meet the requirements of real-time inference at large scale, a high-performance and energy efficient accelerator for RNN is highly desired. However, two reasons make it very difficult to accomplish efficient RNN computation by CPUs or GPUs in parallel [21,22]: (1) recurrent behaviour of RNN architecture which imposes several data-dependencies, (2) limited parallel tasks due to the enforced low batch size by Service-Level Agreements (SLAs) in the inference evaluation [23,24]. FLOP Efficiency (%)…”

Section: Introductionmentioning

confidence: 99%

“…Second, for the online inference scenario, queries come in one-by-one and have stringent latency SLA, often in single milliseconds [23,24]. This requirement further reduces data reuse and available parallelism in RNN inference.…”

Section: Introductionmentioning

confidence: 99%

“…This requirement further reduces data reuse and available parallelism in RNN inference. Recently, there have been several efforts on either CPU [26], or GPU [23,22], to improve the efficiency of RNN inference. However, they show poor scalibility for either small or large models with different sequence length.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

SHARP: An Adaptable, Energy-Efficient Accelerator for Recurrent Neural Networks

Yazdani

Ruwase

Zhang

et al. 2023

ACM Trans. Embed. Comput. Syst.

Self Cite

View full text Add to dashboard Cite

The effectiveness of Recurrent Neural Networks (RNNs) for tasks such as Automatic Speech Recognition has fostered interest in RNN inference acceleration. Due to the recurrent nature and data dependencies of RNN computations, prior work has designed customized architectures specifically tailored to the computation pattern of RNN, getting high computation efficiency for certain chosen model sizes. However, given that the dimensionality of RNNs varies a lot for different tasks, it is crucial to generalize this efficiency to diverse configurations. In this work, we identify adaptiveness as a key feature that is missing from today’s RNN accelerators. In particular, we first show the problem of low resource-utilization and low adaptiveness for the state-of-the-art RNN implementations on GPU, FPGA and ASIC architectures. To solve these issues, we propose an intelligent tiled-based dispatching mechanism for increasing the adaptiveness of RNN computation, in order to efficiently handle the data dependencies. To do so, we propose Sharp as a hardware accelerator, which pipelines RNN computation using an effective scheduling scheme to hide most of the dependent serialization. Furthermore, Sharp employs dynamic reconfigurable architecture to adapt to the model’s characteristics. Sharp achieves 2x, 2.8x, and 82x speedups on average, considering different RNN models and resource budgets, compared to the state-of-the-art ASIC, FPGA, and GPU implementations, respectively. Furthermore, we provide significant energy-reduction with respect to the previous solutions, due to the low power dissipation of Sharp (321 GFLOPS/Watt).

show abstract

“…We compare SHARP against the state-of-the-art GPU, FPGA and ASIC implementations, i.e. cuDNN [20], GRNN [23],…”

Section: Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

SHARP: An Adaptable, Energy-Efficient Accelerator for Recurrent Neural Networks

Yazdani

Ruwase

Zhang

et al. 2023

ACM Trans. Embed. Comput. Syst.

Self Cite

View full text Add to dashboard Cite

show abstract

“…While CCA provides strong data security and enables confidential computing on next-generation Arm devices, the support for GPUs [4], [21], [69], which are widely used to accelerate the general-, high-performance, and artificial intelligence computing scenarios [15], [18], [34], [45], [55], is only recently proposed. However, such support, called RME Device Assignment (RME-DA) [25], is currently a high-level concept without completed hardware implementation.…”

Section: Introductionmentioning

confidence: 99%

CAGE: Complementing Arm CCA with GPU Extensions

University),

Zhang,

Deng

et al. 2024

Proceedings 2024 Network and Distributed System Security Symposium

View full text Add to dashboard Cite

Confidential computing is an emerging technique that provides users and third-party developers with an isolated and transparent execution environment. To support this technique, Arm introduced the Confidential Computing Architecture (CCA), which creates multiple isolated address spaces, known as realms, to ensure data confidentiality and integrity in securitysensitive tasks. Arm recently proposed the concept of confidential computing on GPU hardware, which is widely used in generalpurpose, high-performance, and artificial intelligence computing scenarios. However, hardware and firmware supporting confidential GPU workloads remain unavailable. Existing studies leverage Trusted Execution Environments (TEEs) to secure GPU computing on Arm-or Intel-based platforms, but they are not suitable for CCA's realm-style architecture, such as using incompatible hardware or introducing a large trusted computing base (TCB). Therefore, there is a need to complement existing Arm CCA capabilities with GPU acceleration.To address this challenge, we present CAGE to support confidential GPU computing for Arm CCA. By leveraging the existing security features in Arm CCA, CAGE ensures data security during confidential computing on unified-memory GPUs, the mainstream accelerators in Arm devices. To adapt the GPU workflow to CCA's realm-style architecture, CAGE proposes a novel shadow task mechanism to manage confidential GPU applications flexibly. Additionally, CAGE leverages the memory isolation mechanism in Arm CCA to protect data confidentiality and integrity from the strong adversary. Based on this, CAGE also optimizes security operations in memory isolation to mitigate performance overhead. Without hardware changes, our approach uses the generic hardware security primitives in Arm CCA to defend against a privileged adversary. We present two prototypes to verify CAGE's functionality and evaluate performance, respectively. Results show that CAGE effectively provides GPU support for Arm CCA with an average of 2.45% performance overhead.

show abstract

Fast and accurate modeling of transient‐state, gradient‐spoiled sequences by recurrent neural networks

et al. 2021

View full text Add to dashboard Cite

Funding information China Scholarship Council Fast and accurate modeling of MR signal responses are typically required for various quantitative MRI applications, such as MR fingerprinting. This work uses a new extended phase graph (EPG)-Bloch model for accurate simulation of transient-state, gradient-spoiled MR sequences, and proposes a recurrent neural network (RNN) as a fast surrogate of the EPG-Bloch model for computing large-scale MR signals and derivatives. The computational efficiency of the RNN model is demonstrated by comparisons with other existing models, showing one to three orders of acceleration compared with the latest GPU-accelerated, open-source EPG package. By using numerical and in vivo brain data, two used cases, namely, MRF dictionary generation and optimal experimental design, are also provided. Results show that the RNN surrogate model can be efficiently used for computing large-scale dictionaries of transient-state signals and derivatives within tens of seconds, resulting in several orders of magnitude acceleration with respect to state-of-the-art implementations. The practical application of transient-state quantitative techniques can therefore be substantially facilitated.

show abstract

GRNN

Cited by 36 publications

References 19 publications

SHARP: An Adaptable, Energy-Efficient Accelerator for Recurrent Neural Networks

SHARP: An Adaptable, Energy-Efficient Accelerator for Recurrent Neural Networks

CAGE: Complementing Arm CCA with GPU Extensions

Fast and accurate modeling of transient‐state, gradient‐spoiled sequences by recurrent neural networks

Contact Info

Product

Resources

About