GCoD: Graph Convolutional Network Acceleration via Dedicated Algorithm and Accelerator Co-Design

You, Haoran; Geng, Tong; Zhang, Yongan; Li, Ang; Lin, Yingyan

doi:10.1109/hpca53966.2022.00041

Cited by 40 publications

(8 citation statements)

References 53 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The authors identify the characteristics of Transformer-based models and propose various optimization methods. ViTCoD [34] designs a dedicated accelerator for sparse and dense workloads to boost hardware utilization for vision transformers. Auto-ViT-Acc [36] designs an FPGA accelerator for multi-head attention and an FPGAaware quantization algorithm to make better use of FPGA resources.…”

Section: Sequential Acceleratorsmentioning

confidence: 99%

SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration

Zhuang,

Yang,

et al. 2024

Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays

View full text Add to dashboard Cite

With the increase in the computation intensity of the chip, the mismatch between computation layer shapes and the available computation resource significantly limits the utilization of the chip. Driven by this observation, prior works discuss spatial accelerators or dataflow architecture to maximize the throughput. However, using spatial accelerators could potentially increase the execution latency. In this work, we first systematically investigate two execution models: (1) sequentially (temporally) launch one monolithic accelerator, and (2) spatially launch multiple accelerators. From the observations, we find that there is a latency throughput tradeoff between these two execution models, and combining these two strategies together can give us a more efficient latency throughput Pareto front. To achieve this, we propose spatial sequential architecture (SSR) and SSR design automation framework to explore both strategies together when deploying deep learning inference. We use the 7nm AMD Versal ACAP VCK190 board to implement SSR accelerators for four end-to-end transformer-based deep learning models. SSR achieves average throughput gains of 2.53x, 35.71x, and 14.20x under different batch sizes compared to the 8nm Nvidia GPU A10G, 16nm AMD FPGAs ZCU102, and U250. The average energy efficiency gains are 8.51x, 6.75x, and 21.22x, respectively. Compared with the sequential-only solution and spatial-only solution on VCK190, our spatial-sequential-hybrid solutions achieve higher throughput under the same latency requirement and lower latency under the same throughput requirement. We also use SSR analytical models to demonstrate how to use SSR to optimize solutions on other computing platforms, e.g., 14nm Intel Stratix 10 NX. CCS CONCEPTS• Computer systems organization → Heterogeneous (hybrid) systems; • Hardware → Hardware-software codesign.

show abstract

Section: Sequential Acceleratorsmentioning

confidence: 99%

SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration

Zhuang,

Yang,

et al. 2024

Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays

View full text Add to dashboard Cite

show abstract

“…To take full advantage of this reduction without introducing significant overheads, OuterSPACE builds a custom accelerator with reconfigurable memory hierarchy and achieves a mean speedup of 7.9× over the CPU running Intel Math Kernel Library and 14.0× against the GPU running CUSP. Furthermore, to alleviate the data movement bottleneck caused by high sparsity, ViTCoD [86] uses a learnable auto-encoder to compress the sparse attentions to a much more compact representation and designs encoder and decoder engines to boost the hardware utilization.…”

Section: Memory Efficiencymentioning

confidence: 99%

A Survey on Efficient Training of Transformers

Zhuang¹,

Liu²,

Pan³

et al. 2023

Preprint

View full text Add to dashboard Cite

Recent advances in Transformers have come with a huge requirement on computing resources, highlighting the importance of developing efficient training techniques to make Transformer training faster, at lower cost, and to higher accuracy by the efficient use of computation and memory resources. This survey provides the first systematic overview of the efficient training of Transformers, covering the recent progress in acceleration arithmetic and hardware, with a focus on the former. We analyze and compare methods that save computation and memory costs for intermediate tensors during training, together with techniques on hardware/algorithm co-design. We finally discuss challenges and promising areas for future research.

show abstract

“…There are also some previous studies about algorithmic and hardware co-design for GNNs [36,46,47]. [47] firstly presents a GNN and accelerator automatically co-search framework to maximize both task accuracy and acceleration efficiency.…”

Section: Related Workmentioning

confidence: 99%

“…[36] proposes a model-architecture co-design with a light-weight algorithm for temporal GNN inferences on FPGAs. [46] proposes GCoD, a GCN algorithm and accelerator Co-Design framework, involving a two-pronged accelerator with a separated engine to process dense and sparse workloads. Some previous studies focus on accelerating GNN training [48,49,10,50].…”

Section: Related Workmentioning

confidence: 99%

LL-GNN: Low Latency Graph Neural Networks on FPGAs for Particle Detectors

Que¹,

Loo²,

Fan³

et al. 2022

Preprint

View full text Add to dashboard Cite

This work proposes a novel reconfigurable architecture for low latency Graph Neural Network (GNN) design specifically for particle detectors. Accelerating GNNs for particle detectors is challenging since it requires sub-microsecond latency to deploy the networks for online event selection in the Level-1 triggers at the CERN Large Hadron Collider experiments. This paper proposes a custom code transformation with strength reduction for the matrix multiplication operations in the interaction-network based GNNs with fully connected graphs, which avoids the costly multiplication of the adjacency matrix with the input feature matrix. It exploits sparsity patterns as well as binary adjacency matrices, and avoids irregular memory access, leading to a reduction in latency and improvement in hardware efficiency. In addition, we introduce an outer-product based matrix multiplication approach which is enhanced by the strength reduction for low latency design. Also, a fusion step is introduced to further reduce the design latency. Furthermore, an GNN-specific algorithm-hardware co-design approach is presented which not only finds a design with a much better latency but also finds a high accuracy design under a given latency constraint. Finally, a customizable template for this low latency GNN hardware architecture has been designed and open-sourced, which enables the generation of low-latency FPGA designs with efficient resource utilization using a high-level synthesis tool. Evaluation results show that our FPGA implementation is up to 24 times faster and consumes up to 45 times less power than a GPU implementation. Compared to our previous FPGA implementations, this work achieves 6.51 to 16.7 times lower latency. Moreover, the latency of our FPGA design is sufficiently low to enable deployment of GNNs in a sub-microsecond, real-time collider trigger system, enabling it to benefit from improved accuracy.

show abstract

GCoD: Graph Convolutional Network Acceleration via Dedicated Algorithm and Accelerator Co-Design

Cited by 40 publications

References 53 publications

SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration

SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration

A Survey on Efficient Training of Transformers

LL-GNN: Low Latency Graph Neural Networks on FPGAs for Particle Detectors

Contact Info

Product

Resources

About