Accommodating Transformer onto FPGA

Qi, Panjie; Song, Yuhong; Peng, Hongwu; Huang, Shaoyi; Zhuge, Qingfeng; Sha, Edwin Hsing-Mean

doi:10.1145/3453688.3461739

Cited by 27 publications

(6 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As shown in Figure 1, despite the potential benefits of FPGA-based acceleration for transformer models, several challenges must be addressed to achieve efficient implementation [18][19][20]. A notable challenge is the long development cycle associated with hardware design, which may be exacerbated by the need to tune various high-level synthesis (HLS) parameters to optimize the final hardware architecture [14,18].…”

Section: Challenges In Fpga-based Transformer Accelerationmentioning

confidence: 99%

“…A notable challenge is the long development cycle associated with hardware design, which may be exacerbated by the need to tune various high-level synthesis (HLS) parameters to optimize the final hardware architecture [14,18]. Various parameters, such as parallelism, memory type, and memory size, significantly affect FPGA resource utilization, performance, and power consumption [19,21,22]. In the realm of hardware design research, many studies focus on how to reduce model size through quantization methods, thereby shrinking the hardware footprint.…”

Section: Challenges In Fpga-based Transformer Accelerationmentioning

confidence: 99%

See 1 more Smart Citation

Reinforcement Learning-Driven Bit-Width Optimization for the High-Level Synthesis of Transformer Designs on Field-Programmable Gate Arrays

Jang,

Cho

2024

Electronics

View full text Add to dashboard Cite

With the rapid development of deep-learning models, especially the widespread adoption of transformer architectures, the demand for efficient hardware accelerators with field-programmable gate arrays (FPGAs) has increased owing to their flexibility and performance advantages. Although high-level synthesis can shorten the hardware design cycle, determining the optimal bit-width for various transformer designs remains challenging. Therefore, this paper proposes a novel technique based on a predesigned transformer hardware architecture tailored for various types of FPGAs. The proposed method leverages a reinforcement learning-driven mechanism to automatically adapt and optimize bit-width settings based on user-provided transformer variants during inference on an FPGA, significantly alleviating the challenges related to bit-width optimization. The effect of bit-width settings on resource utilization and performance across different FPGA types was analyzed. The efficacy of the proposed method was demonstrated by optimizing the bit-width settings for users’ transformer-based model inferences on an FPGA. The use of the predesigned hardware architecture significantly enhanced the performance. Overall, the proposed method enables effective and optimized implementations of user-provided transformer-based models on an FPGA, paving the way for edge FPGA-based deep-learning accelerators while reducing the time and effort typically required in fine-tuning bit-width settings.

show abstract

Section: Challenges In Fpga-based Transformer Accelerationmentioning

confidence: 99%

Section: Challenges In Fpga-based Transformer Accelerationmentioning

confidence: 99%

Reinforcement Learning-Driven Bit-Width Optimization for the High-Level Synthesis of Transformer Designs on Field-Programmable Gate Arrays

Jang,

Cho

2024

Electronics

View full text Add to dashboard Cite

show abstract

“…When comparing the characteristics implemented on a reconfigurable device with other works, to our knowledge, there is no research that has been able to realize a configurable accelerator, which can complete the whole-layer network inference and have compatible dense and sparse characteristics simultaneously. Authors of [5] and [12] show how to deploy a complete single-layer network, but the corresponding hardware for each operator on the dataflow is arranged, which is not configurable, and both of which are aimed at hardware acceleration for specific sparse matrices. Designs included in [8] and [9] have generic arrays for matrix operations.…”

Section: Characteristics Of Acceleratorsmentioning

confidence: 99%

EFA-Trans: An Efficient and Flexible Acceleration Architecture for Transformers

Yang

2022

Electronics

View full text Add to dashboard Cite

The topic of transformers is rapidly emerging as one of the most important key primitives in neural networks. Unfortunately, most hardware designs for transformers are deficient, either hardly considering the configurability of the design or failing to realize the complete inference process of transformers. Specifically, few studies have paid attention to the compatibility of different computing paradigms. Thus, this paper presents EFA-Trans, a highly efficient and flexible hardware accelerator architecture for transformers. To reach high performance, we propose a configurable matrix computing array and leverage on-chip memories optimizations. In addition, with the design of nonlinear modules and fine-grained scheduling, our architecture can perform complete transformer inference. EFA-Trans is also compatible with dense and sparse patterns, which further expands its application scenarios. Moreover, a performance analytic model is abstracted to guide the determination of architecture parameter sets. Finally, our designs are developed by RTL and evaluated on Xilinx ZCU102. Experimental results demonstrate that EFA-Trans provides 23.74× and 7.58× improvement in energy efficiency compared with CPU and GPU, respectively. It also shows DSP efficiency is between 3.59× and 21.07× higher than others, outperforming existing advanced works.

show abstract

“…Machine-Learning-as-a-Service (MLaaS) has emerged as a popular solution for accelerating inference in various applications [1]- [11]. The challenges of MLaaS comes from several folds: inference latency and privacy.…”

Section: Introductionmentioning

confidence: 99%

RRNet: Towards ReLU-Reduced Neural Network for Two-party Computation Based Private Inference

Peng¹,

Zhou²,

Luo³

et al. 2023

Preprint

View full text Add to dashboard Cite

The proliferation of deep learning (DL) has led to the emergence of privacy and security concerns. To address these issues, secure Two-party computation (2PC) has been proposed as a means of enabling privacy-preserving DL computation. However, in practice, 2PC methods often incur high computation and communication overhead, which can impede their use in largescale systems. To address this challenge, we introduce RRNet, a systematic framework that aims to jointly reduce the overhead of MPC comparison protocols and accelerate computation through hardware acceleration. Our approach integrates the hardware latency of cryptographic building blocks into the DNN loss function, resulting in improved energy efficiency, accuracy, and security guarantees. Furthermore, we propose a cryptographic hardware scheduler and corresponding performance model for Field Programmable Gate Arrays (FPGAs) to further enhance the efficiency of our framework. Experiments show RRNet achieved a much higher ReLU reduction performance than all SOTA works on CIFAR-10 dataset.

show abstract

Accommodating Transformer onto FPGA

Cited by 27 publications

References 13 publications

Reinforcement Learning-Driven Bit-Width Optimization for the High-Level Synthesis of Transformer Designs on Field-Programmable Gate Arrays

Reinforcement Learning-Driven Bit-Width Optimization for the High-Level Synthesis of Transformer Designs on Field-Programmable Gate Arrays

EFA-Trans: An Efficient and Flexible Acceleration Architecture for Transformers

RRNet: Towards ReLU-Reduced Neural Network for Two-party Computation Based Private Inference

Contact Info

Product

Resources

About