DOTA: detect and omit weak attentions for scalable transformer acceleration

Qu, Zheng; Liu, Liu; Tu, Fengbin; Chen, Zhaodong; Ding, Yufei; Xie, Yuan

doi:10.1145/3503222.3507738

Cited by 38 publications

(20 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our ViTCoD split and conquer algorithm exhibits a great potential in both reducing the dominate attention computations and alleviating the irregularity of the resulting sparse attention masks. However, this potential cannot be fully exploited by existing Transformer accelerators [21], [27], [39] due to the fact that (1) they are designed for dynamic sparse attention which requires both on-the-fly mask generation and highly reconfigurable architecture supports, both of which require nontrivial overheads, and (2) they are not dedicated for processing the enforced two distinct workloads, i.e., denser and sparser patterns, from our ViTCoD algorithm. As such, our ViTCoD accelerator is motivated to exploit the new opportunities i.e., fixed and structurally sparse patterns, resulting from ViTCoD algorithm to boost ViTs' inference efficiency.…”

Section: A Motivation Of Vitcod Acceleratormentioning

confidence: 99%

“…Baselines: To benchmark ViTCoD with SOTA attention accelerators, we consider a total of five baselines, including three general platforms: CPU (Intel Xeon Gold 6230R), EdgeGPU (Nvidia Jetson Xavier NX), and GPU (Nvidia 2080Ti), and two attention accelerators: SpAtten [39] and Sanger [21]. Note that when benchmarking with GPUs w/ larger batch size, we scale up the accelerators' hardware resource to have a comparable peak throughput for a fair comparison following [27]. Metrics: We 0.9 0.9 1.9 evaluate all platforms in terms of latency speedups and energy efficiency.…”

Section: Experiments a Experiments Settingmentioning

confidence: 99%

“…SpAtten [39] structurally removes unnecessary attention heads and input tokens, which is therefore coarse-grained and leads to a low achievable sparsity ratio; Sanger [21] adopts low precision Q and K vectors for estimating the sparse attention masks, which are then packed and split to be more regular and friendly supported by a reconfigurable architecture. DOTA [27] considers both low precision and low rank linear transformation to predict the sparse attention masks, and explores token-level parallelism and out-of-order execution for locality-aware computing. All above works focus on NLP Transformers, and thus require dynamic and input-dependent sparse masks prediction.…”

Section: Related Work Vision Transformers (Vits) Motivated By Transfo...mentioning

confidence: 99%

“…accounts for over 50% of the total latency measured on a TITAN Xp GPU [39]; This percentage increases to 69% for LeViT-128 [9] when measured on an EdgeGPU [23]. To alleviate the bottleneck complexity of self-attention modules, sparse attention techniques have emerged as a promising solution and been considered by both algorithm [3], [36], [44] and hardware acceleration [11], [21], [27], [39] works. Despite their great promise, existing sparse attention accelerators or algorithm-accelerator co-design works (e.g., Sanger [21]) focus on accelerating NLP Transformers, and adopt hardware designs with on-the-fly sparse attention prediction and high reconfigurability in order to handle the varying number of input tokens in NLP.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

GCoD: Graph Convolutional Network Acceleration via Dedicated Algorithm and Accelerator Co-Design

You

Geng

Zhang

et al. 2022

2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

Vision Transformers (ViTs) have achieved state-ofthe-art performance on various vision tasks. However, ViTs' selfattention module is still arguably a major bottleneck, limiting their achievable hardware efficiency and more extensive applications to resource constrained platforms. Meanwhile, existing accelerators dedicated to NLP Transformers are not optimal for ViTs. This is because there is a large difference between ViTs and Transformers for natural language processing (NLP) tasks: ViTs have a relatively fixed number of input tokens, whose attention maps can be pruned by up to 90% even with fixed sparse patterns, without severely hurting the model accuracy (e.g., <=1.5% under 90% pruning ratio); while NLP Transformers need to handle input sequences of varying numbers of tokens and rely on on-the-fly predictions of dynamic sparse attention patterns for each input to achieve a decent sparsity (e.g., >=50%). To this end, we propose a dedicated algorithm and accelerator co-design framework dubbed ViTCoD for accelerating ViTs. Specifically, on the algorithm level, ViTCoD prunes and polarizes the attention maps to have either denser or sparser fixed patterns for regularizing two levels of workloads without hurting the accuracy, largely reducing the attention computations while leaving room for alleviating the remaining dominant data movements; on top of that, we further integrate a lightweight and learnable auto-encoder module to enable trading the dominant high-cost data movements for lower-cost computations. On the hardware level, we develop a dedicated accelerator to simultaneously coordinate the aforementioned enforced denser and sparser workloads for boosted hardware utilization, while integrating on-chip encoder and decoder engines to leverage ViTCoD's algorithm pipeline for much reduced data movements. Extensive experiments and ablation studies validate that ViTCoD largely reduces the dominant data movement costs, achieving speedups of up to 235.3×, 142.9×, 86.0×, 10.1×, and 6.8× over general computing platforms CPUs, EdgeGPUs, GPUs, and prior-art Transformer accelerators SpAtten and Sanger under an attention sparsity of 90%, respectively.

show abstract

Section: A Motivation Of Vitcod Acceleratormentioning

confidence: 99%

Section: Experiments a Experiments Settingmentioning

confidence: 99%

Section: Related Work Vision Transformers (Vits) Motivated By Transfo...mentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

GCoD: Graph Convolutional Network Acceleration via Dedicated Algorithm and Accelerator Co-Design

You

Geng

Zhang

et al. 2022

2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

show abstract

“…Then, the sparse attention matrix with reduced entries goes through the softmax operation after which it is multiplied by a dense value matrix. Many works in algorithm [2], [10], [12], [13], [48], [49] and hardware [21], [22], [28], [33], [40] have been proposed to implement such sparse attentions for NLPbased Transformer models by efficiently tackling various static and dynamic sparse patterns.…”

Section: Introductionmentioning

confidence: 99%

ViTALiTy: Unifying Low-rank and Sparse Approximation for Vision Transformer Acceleration with a Linear Taylor Attention

Dass¹,

Wang²,

Shi³

et al. 2022

Preprint

View full text Add to dashboard Cite

Vision Transformer (ViT) has emerged as a competitive alternative to convolutional neural networks for various computer vision applications. Specifically, ViTs' multi-head attention layers make it possible to embed information globally across the overall image. Nevertheless, computing and storing such attention matrices incurs a quadratic cost dependency on the number of patches, limiting its achievable efficiency and scalability and prohibiting more extensive real-world ViT applications on resource-constrained devices. Sparse attention has been shown to be a promising direction for improving hardware acceleration efficiency for NLP models. However, a systematic counterpart approach is still missing for accelerating ViT models. To close the above gap, we propose a firstof-its-kind algorithm-hardware codesigned framework, dubbed VITALITY, for boosting the inference efficiency of ViTs. Unlike sparsity-based Transformer accelerators for NLP, VITALITY unifies both low-rank and sparse components of the attention in ViTs. At the algorithm level, we approximate the dot-product softmax operation via first-order Taylor attention with row-mean centering as the low-rank component to linearize the cost of attention blocks and further boost the accuracy by incorporating a sparsity-based regularization. At the hardware level, we develop a dedicated accelerator to better leverage the resulting workload and pipeline from VITALITY's linear Taylor attention which requires the execution of only the low-rank component, to further boost the hardware efficiency. Extensive experiments and ablation studies validate that VITALITY offers boosted endto-end efficiency (e.g., 3× faster and 3× energy-efficient) under comparable accuracy, with respect to the state-of-the-art solution.

show abstract

Efficient memristor accelerator for transformer self-attention functionality

Bettayeb,

Halawani,

Khan

et al. 2024

Sci Rep

View full text Add to dashboard Cite

The adoption of transformer networks has experienced a notable surge in various AI applications. However, the increased computational complexity, stemming primarily from the self-attention mechanism, parallels the manner in which convolution operations constrain the capabilities and speed of convolutional neural networks (CNNs). The self-attention algorithm, specifically the matrix-matrix multiplication (MatMul) operations, demands a substantial amount of memory and computational complexity, thereby restricting the overall performance of the transformer. This paper introduces an efficient hardware accelerator for the transformer network, leveraging memristor-based in-memory computing. The design targets the memory bottleneck associated with MatMul operations in the self-attention process, utilizing approximate analog computation and the highly parallel computations facilitated by the memristor crossbar architecture. Remarkably, this approach resulted in a reduction of approximately 10 times in the number of multiply-accumulate (MAC) operations in transformer networks, while maintaining 95.47% accuracy for the MNIST dataset, as validated by a comprehensive circuit simulator employing NeuroSim 3.0. Simulation outcomes indicate an area utilization of 6895.7 , a latency of 15.52 seconds, an energy consumption of 3 mJ , and a leakage power of 59.55 . The methodology outlined in this paper represents a substantial stride towards a hardware-friendly transformer architecture for edge devices, poised to achieve real-time performance.

show abstract

DOTA: detect and omit weak attentions for scalable transformer acceleration

Cited by 38 publications

References 37 publications

GCoD: Graph Convolutional Network Acceleration via Dedicated Algorithm and Accelerator Co-Design

GCoD: Graph Convolutional Network Acceleration via Dedicated Algorithm and Accelerator Co-Design

ViTALiTy: Unifying Low-rank and Sparse Approximation for Vision Transformer Acceleration with a Linear Taylor Attention

Efficient memristor accelerator for transformer self-attention functionality

Contact Info

Product

Resources

About