2022
DOI: 10.48550/arxiv.2205.14135
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Abstract: Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup. We argue that a missing principle is making attention algorithms IOaware-accounting for reads and writes between levels of GPU memory. We propose FlashAttention, an IO-aware exact attention … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
37
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 35 publications
(37 citation statements)
references
References 33 publications
0
37
0
Order By: Relevance
“…These optimizations create trade-offs between memory consumption and speed that can be tuned differently for training and inference. They include advanced implementations of neural network attention mechanisms (Vaswani et al 2017) with favorable properties for unusually short and long sequences (Rabe and Staats 2021, Dao et al 2022), module refactoring for lower memory usage, optional approximations of certain computations that reduce the memory burden, and specialized low-level code customized for GPU hardware. For technical details see appendices F.1 and F.2.…”
Section: Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…These optimizations create trade-offs between memory consumption and speed that can be tuned differently for training and inference. They include advanced implementations of neural network attention mechanisms (Vaswani et al 2017) with favorable properties for unusually short and long sequences (Rabe and Staats 2021, Dao et al 2022), module refactoring for lower memory usage, optional approximations of certain computations that reduce the memory burden, and specialized low-level code customized for GPU hardware. For technical details see appendices F.1 and F.2.…”
Section: Resultsmentioning
confidence: 99%
“…FlashAttention: We incorporate FlashAttention (Dao et al 2022), an efficient fused attention implementation that tiles computation in order to reduce data movement between different levels of GPU memory, greatly improving peak memory usage and runtime in the process. We find it to be particularly effective for short sequences with 1,000 residues or less, on which it contributes to an OpenFold speedup of up to 15% despite only being compatible with a small number of the attention modules in the network.…”
Section: Appendix a Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…This greatly exceeds the input length of common transformers used in NLM. Efficient self-attention techniques can be used (Katharopoulos et al, 2020;Wang et al, 2020;Dao et al, 2022). Also, since the order of the genes is not sequential in scRNA-seq data, and the transformer computation is agnostic to the order, we can dynamically sample subsets of the input.…”
Section: Encoder and Gene Expression Modelingmentioning
confidence: 99%
“…Among which, PyTorch provides a standard implementation of MHA [28]; NVIDIA TensorRT provides fused MHA for short sequences whose lengths are smaller than 512 [29]. To scale the fused MHA to long sequences, Stanford researchers propose FlashAttention [30], which assumes identical shapes of inputs and assigns the workload of a whole attention unit to a single CTA. However, FlashAttention brings significant wasted computations if input sequence lengths are variable.…”
Section: B Related Work On DL Accelerationmentioning
confidence: 99%