Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-280
|View full text |Cite
|
Sign up to set email alerts
|

Stochastic Attention Head Removal: A Simple and Effective Method for Improving Transformer Based ASR Models

Abstract: Recently, Transformer based models have shown competitive automatic speech recognition (ASR) performance. One key factor in the success of these models is the multi-head attention mechanism. However, for trained models, we have previously observed that many attention matrices are close to diagonal, indicating the redundancy of the corresponding attention heads. We have also found that some architectures with reduced numbers of attention heads have better performance. Since the search for the best structure is … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
2
1

Relationship

1
5

Authors

Journals

citations
Cited by 7 publications
(2 citation statements)
references
References 26 publications
0
2
0
Order By: Relevance
“…We also compared the proposed models with other alternative systems including the end-to-end (E2E) stochastic attention head removal (SAHR) [70], multi-stream E2E [71] and the hybrid multi-scale octave CNNs [72], parametric (Parznet) 2- [73] 24.9 26.0 --Parznet 2D-CNN+VI (Hybrid) [12] 24. D CNNs without [73] and with variational inference (VI) [12].…”
Section: B Results and Discussionmentioning
confidence: 99%
“…We also compared the proposed models with other alternative systems including the end-to-end (E2E) stochastic attention head removal (SAHR) [70], multi-stream E2E [71] and the hybrid multi-scale octave CNNs [72], parametric (Parznet) 2- [73] 24.9 26.0 --Parznet 2D-CNN+VI (Hybrid) [12] 24. D CNNs without [73] and with variational inference (VI) [12].…”
Section: B Results and Discussionmentioning
confidence: 99%
“…The scaling factor s g compensates for the masked portion and maintains the statistics after gating is applied. Recently, attention head dropout [17,18] has been introduced with similar scaling, but their scaling is used for regularization during training. We found that this technique greatly stabilizes the training dynamics, especially after the training is stabilized by the above two methods.…”
Section: Techniques For Head Pruningmentioning
confidence: 99%