2021 IEEE Spoken Language Technology Workshop (SLT) 2021
DOI: 10.1109/slt48900.2021.9383521
|View full text |Cite
|
Sign up to set email alerts
|

On The Usefulness of Self-Attention for Automatic Speech Recognition with Transformers

Abstract: Self-attention models such as Transformers, which can capture temporal relationships without being limited by the distance between events, have given competitive speech recognition results. However, we note the range of the learned context increases from the lower to upper self-attention layers, whilst acoustic events often happen within short time spans in a left-to-right order. This leads to a question: for speech recognition, is a global view of the entire sequence useful for the upper self-attention encode… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
14
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2

Relationship

1
5

Authors

Journals

citations
Cited by 25 publications
(15 citation statements)
references
References 23 publications
1
14
0
Order By: Relevance
“…We introduce the stochastic attention head removal (SAHR) strategy in this section. In our previous work [11], we found in trained Transformer ASR models, there are heads which only generates nearly diagonal attention matrices. The high diagonality indicates these heads are merely identity mappings and thus they are redundant.…”
Section: Stochastic Attention Head Removalmentioning
confidence: 91%
See 4 more Smart Citations
“…We introduce the stochastic attention head removal (SAHR) strategy in this section. In our previous work [11], we found in trained Transformer ASR models, there are heads which only generates nearly diagonal attention matrices. The high diagonality indicates these heads are merely identity mappings and thus they are redundant.…”
Section: Stochastic Attention Head Removalmentioning
confidence: 91%
“…Zhou et al [21] proposed a similar method of training selfattention models for natural language processing (NLP). However, the related work does not analyse where the benefits of such methods come from, whilst in contrast, we independently derive our method based on our previous work which studies the usefulness of the attention heads [11]. Additionally, in this work, our detailed analysis of the proposed method reveals why dynamically removing attention heads is helpful.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations