Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup. We argue that a missing principle is making attention algorithms IOaware-accounting for reads and writes between levels of GPU memory. We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM. We analyze the IO complexity of FlashAttention, showing that it requires fewer HBM accesses than standard attention, and is optimal for a range of SRAM sizes. We also extend FlashAttention to block-sparse attention, yielding an approximate attention algorithm that is faster than any existing approximate attention method. FlashAttention trains Transformers faster than existing baselines: 15% end-to-end wall-clock speedup on BERT-large (seq. length 512) compared to the MLPerf 1.1 training speed record, 3× speedup on GPT-2 (seq. length 1K), and 2.4× speedup on long-range arena (seq. length 1K-4K). FlashAttention and block-sparse FlashAttention enable longer context in Transformers, yielding higher quality models (0.7 better perplexity on GPT-2 and 6.4 points of lift on long-document classification) and entirely new capabilities: the first Transformers to achieve better-than-chance performance on the Path-X challenge (seq. length 16K, 61.4% accuracy) and Path-256 (seq. length 64K, 63.1% accuracy).
Recent advances in deep learning have relied heavily on the use of large Transformers due to their ability to learn at scale. However, the core building block of Transformers, the attention operator, exhibits quadratic cost in sequence length, limiting the amount of context accessible. Existing subquadratic methods based on low-rank and sparse approximations need to be combined with dense attention layers to match Transformers, indicating a gap in capability. In this work, we propose Hyena, a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating. In recall and reasoning tasks on sequences of thousands to hundreds of thousands of tokens, Hyena improves accuracy by more than 50 points over operators relying on statespaces and other implicit and explicit methods, matching attention-based models. We set a new state-ofthe-art for dense-attention-free architectures on language modeling in standard datasets (WikiText103 and The Pile), reaching Transformer quality with a 20% reduction in training compute required at sequence length 2K. Hyena operators are twice as fast as highly optimized attention at sequence length 8K, and 100× faster at sequence length 64K.
Experimental DataOrexinergic neurotransmission is involved in mediating temperature responses to methamphetamine (Meth). In experiments in rats, SB-334867 (SB), an antagonist of orexin receptors (OX1R), at a dose of 10 mg/kg decreases late temperature responses (t>60 min) to an intermediate dose of Meth (5 mg/kg). A higher dose of SB (30 mg/kg) attenuates temperature responses to low dose (1 mg/kg) of Meth and to stress. In contrast, it significantly exaggerates early responses (t<60 min) to intermediate and high doses (5 and 10 mg/kg) of Meth. As pretreatment with SB also inhibits temperature response to the stress of injection, traditional statistical analysis of temperature responses is difficult.Mathematical ModelingWe have developed a mathematical model that explains the complexity of temperature responses to Meth as the interplay between excitatory and inhibitory nodes. We have extended the developed model to include the stress of manipulations and the effects of SB. Stress is synergistic with Meth on the action on excitatory node. Orexin receptors mediate an activation of on both excitatory and inhibitory nodes by low doses of Meth, but not on the node activated by high doses (HD). Exaggeration of early responses to high doses of Meth involves disinhibition: low dose of SB decreases tonic inhibition of HD and lowers the activation threshold, while the higher dose suppresses the inhibitory component. Using a modeling approach to data assimilation appears efficient in separating individual components of complex response with statistical analysis unachievable by traditional data processing methods.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.