2023
DOI: 10.1007/978-3-031-25082-8_3
|View full text |Cite
|
Sign up to set email alerts
|

Hydra Attention: Efficient Attention with Many Heads

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
28
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 41 publications
(28 citation statements)
references
References 18 publications
0
28
0
Order By: Relevance
“…Similar to our approach, various recent methods have achieved computation efficiency by skipping computation on a subset of input tokens. However, the selection mechanism can be very different, such as using pooling (Nawrot et al, 2022), token merging (Bolya et al, 2023), learned sigmoid gates (Bapna et al, 2020) and early exiting (Schuster et al, 2022). CODA introduces a differentiable router to enhance trainability and model performance, and tackles the problem of large model adaptation.…”
Section: Conditional Computationmentioning
confidence: 99%
“…Similar to our approach, various recent methods have achieved computation efficiency by skipping computation on a subset of input tokens. However, the selection mechanism can be very different, such as using pooling (Nawrot et al, 2022), token merging (Bolya et al, 2023), learned sigmoid gates (Bapna et al, 2020) and early exiting (Schuster et al, 2022). CODA introduces a differentiable router to enhance trainability and model performance, and tackles the problem of large model adaptation.…”
Section: Conditional Computationmentioning
confidence: 99%
“…One way for rendering light weighted vision Transformer is to simplify the layers of Transformer [40,25,8], but its benefit is limited since the major complexity arises from self-attention, not layer stack. So, some other efforts are focused on altering the internal operations of Transformer to make self-attention more efficient [36,4,2]. As for Hydra Attention [2], the computing order insider selfattention is reorganized while the conbination of multiple heads is incorporated into self-attention to reduce the complexity.…”
Section: Related Workmentioning
confidence: 99%
“…So, some other efforts are focused on altering the internal operations of Transformer to make self-attention more efficient [36,4,2]. As for Hydra Attention [2], the computing order insider selfattention is reorganized while the conbination of multiple heads is incorporated into self-attention to reduce the complexity. Nevertheless, it is workable only when no nonlinear component such as SoftMax is applied in self-attention, which limits its applications.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations