Sanger: A Co-Design Framework for Enabling Sparse Attention using Reconfigurable Architecture

Lu, Liqiang; Jin, Yaqi; Bi, Hangrui; Luo, Zizhang; Wang, Tao; Liang, Yun

doi:10.1145/3466752.3480125

Cited by 80 publications

(62 citation statements)

References 53 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recent works look into co-designing for sparse architectures. Sanger prunes the attention matrix for its reconfigurable architecture to exploit [36]. ESCALATE utilized kernel decomposition to accelerate CNN models [33].…”

Section: Related Workmentioning

confidence: 99%

Enabling Flexibility for Sparse Tensor Acceleration via Heterogeneity

Qin¹,

Garg²,

Bambhaniya³

et al. 2022

Preprint

View full text Add to dashboard Cite

Recently, numerous sparse hardware accelerators for Deep Neural Networks (DNNs), Graph Neural Networks (GNNs), and scientific computing applications have been proposed. A common characteristic among all of these accelerators is that they target tensor algebra (typically matrix multiplications); yet dozens of new accelerators are proposed for every new application. The motivation is that the size and sparsity of the workloads heavily influence which architecture is best for memory and computation efficiency. To satisfy the growing demand of efficient computations across a spectrum of workloads on large data-centers, we propose deploying a flexible 'heterogeneous' accelerator, which contains many 'sub-accelerators' (smaller specialized accelerators) working together. To this end, we propose: (1) HARD TACO, a quick and productive C++ to RTL design flow to generate many types of sub-accelerators for sparse and dense computations for fair design-space exploration, (2) AESPA, a heterogeneous sparse accelerator design template constructed with the sub-accelerators generated from HARD TACO, and (3) a suite of scheduling strategies to map tensor kernels onto heterogeneous sparse accelerators with high efficiency and utilization. AESPA with optimized scheduling achieves 1.96× higher performance, and 7.9× better energy delay product (EDP) than a Homogeneous EIE-like accelerator with our diverse workload suite.

show abstract

Section: Related Workmentioning

confidence: 99%

Enabling Flexibility for Sparse Tensor Acceleration via Heterogeneity

Qin¹,

Garg²,

Bambhaniya³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Hardware-algorithm co-design for attention models. Several algorithmic optimizations co-designed with hardware acceleration were proposed for efficient execution of attention models [34,35,60,64,89,92,96,108]. 𝐴 3 has proposed an approximation method with a hardware accelerator to prune out the ineffectual computations in attention.…”

Section: Related Workmentioning

confidence: 99%

“…EdgeBERT [92] leverages entropy-based early exiting technique to predict the minimal number of transformer layers that need to be executed, while the rest can be skipped. Other works aim to address the computational cost of self-attention via sparse matrix operation [13,60,64], quantization [108], and Softmax approximation [89]. Moreover, none of these prior designs explored bit-level early compute termination.…”

Section: Related Workmentioning

confidence: 99%

Accelerating Attention through Gradient-Based Learned Runtime Pruning

Li¹,

Ghodrati²,

Yazdanbakhsh³

et al. 2022

Preprint

View full text Add to dashboard Cite

Self-attention is a key enabler of state-of-art accuracy for various transformer-based Natural Language Processing models. This attention mechanism calculates a correlation score for each word with respect to the other words in a sentence. Commonly, only a small subset of words highly correlates with the word under attention, which is only determined at runtime. As such, a significant amount of computation is inconsequential due to low attention scores and can potentially be pruned. The main challenge is finding the threshold for the scores below which subsequent computation will be inconsequential. Although such a threshold is discrete, this paper formulates its search through a soft differentiable regularizer integrated into the loss function of the training. This formulation piggy backs on the back-propagation training to analytically co-optimize the threshold and the weights simultaneously, striking a formally optimal balance between accuracy and computation pruning. To best utilize this mathematical innovation, we devise a bit-serial architecture, dubbed LeOPArd, for transformer language models with bit-level early termination microarchitectural mechanism. We evaluate our design across 43 back-end tasks for MemN2N, BERT, ALBERT, GPT-2, and Vision transformer models. Post-layout results show that, on average, LeOPArd yields 1.9×and 3.9×speedup and energy reduction, respectively, while keeping the average accuracy virtually intact (< 0.2% degradation).1 LeOPArd: Learning thrEsholds for On-the-fly Pruning Acceleration of tRansformer moDels.

show abstract

“…CSP avoids sparsity skipping logic and instead incorporates an early stop mechanism based on the induced sparsity pattern. Sanger [24] is another 2-way sparse approach that targets the dynamic structures of attentionbased models (i.e., Logit and Attend operators); it dynamically applies fine-grained structure pruning with a dataflow that is well suited for Logit and Attend operators. CSP-A is not a dynamic pruning method and instead targets the static elements of the attention layers, thus treating the Logit and Attend operators as dense.…”

Section: Related Workmentioning

confidence: 99%

Cascading structured pruning

Hanson

et al. 2022

Proceedings of the 49th Annual International Symposium on Computer Architecture

View full text Add to dashboard Cite

Performance and efficiency of running modern Deep Neural Networks (DNNs) are heavily bounded by data movement. To mitigate the data movement bottlenecks, recent DNN inference accelerator designs widely adopt aggressive compression techniques and sparse-skipping mechanisms. These mechanisms avoid transferring or computing with zero-valued weights or activations to save time and energy. However, such sparse-skipping logic involves large input buffers and irregular data access patterns, thus precluding many energy-efficient data reuse opportunities and dataflows. In this work, we propose Cascading Structured Pruning (CSP), a technique that preserves significantly more data reuse opportunities for higher energy efficiency while maintaining comparable performance relative to recent sparse architectures such as SparTen. CSP includes the following two components: At algorithm level, CSP-A induces a predictable sparsity pattern that allows for low-overhead compression of weight data and sequential access to both activation and weight data. At architecture level, CSP-H leverages CSP-A's induced sparsity pattern with a novel dataflow to access unique activation data only once, thus removing the demand for large input buffers. Each CSP-H processing element (PE) employs a novel accumulation buffer design and a counter-based sparse-skipping mechanism to support the dataflow with minimum controller overhead. We verify our approach on several representative models. Our simulated results show that CSP achieves on average 15× energy efficiency improvement over SparTen with comparable or superior speedup under most evaluations.

show abstract

Sanger: A Co-Design Framework for Enabling Sparse Attention using Reconfigurable Architecture

Cited by 80 publications

References 53 publications

Enabling Flexibility for Sparse Tensor Acceleration via Heterogeneity

Enabling Flexibility for Sparse Tensor Acceleration via Heterogeneity

Accelerating Attention through Gradient-Based Learned Runtime Pruning

Cascading structured pruning

Contact Info

Product

Resources

About