2021
DOI: 10.1162/tacl_a_00436
|View full text |Cite
|
Sign up to set email alerts
|

Differentiable Subset Pruning of Transformer Heads

Abstract: Multi-head attention, a collection of several attention mechanisms that independently attend to different parts of the input, is the key ingredient in the Transformer. Recent work has shown, however, that a large proportion of the heads in a Transformer’s multi-head attention mechanism can be safely pruned away without significantly harming the performance of the model; such pruning leads to models that are noticeably smaller and faster in practice. Our work introduces a new head pruning technique that we term… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
9
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3
2

Relationship

0
10

Authors

Journals

citations
Cited by 18 publications
(9 citation statements)
references
References 23 publications
0
9
0
Order By: Relevance
“…Only removing heads does not lead to large latency improvement- Li et al (2021) demonstrate a 1.4× speedup with only one remaining head per layer.…”
Section: Pruningmentioning
confidence: 98%
“…Only removing heads does not lead to large latency improvement- Li et al (2021) demonstrate a 1.4× speedup with only one remaining head per layer.…”
Section: Pruningmentioning
confidence: 98%
“…Pruning methods In this work we replaced the attention matrix with a constant one in order to measure the importance of the input-dependent ability. Works like Michel et al (2019) and Li et al (2021) pruned attention heads in order to measure their importance for the task examined. These works find that for some tasks, only a small number of unpruned attention heads is sufficient, and thus relate to the question of how much attention does a PLM use.…”
Section: Related Workmentioning
confidence: 99%
“…The top-k attention heads and hidden dimensions with the highest importance scores are kept. The implementations for STEP are borrowed from [35]. Aside from the baseline methods, we also compare our method with the previous pruning method VPT [67].…”
Section: Methodsmentioning
confidence: 99%