On The Usefulness of Self-Attention for Automatic Speech Recognition with Transformers

Zhang, Shucong; Loweimi, Erfan; Bell, P. J.; Renals, Steve

doi:10.1109/slt48900.2021.9383521

Cited by 25 publications

(15 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We introduce the stochastic attention head removal (SAHR) strategy in this section. In our previous work [11], we found in trained Transformer ASR models, there are heads which only generates nearly diagonal attention matrices. The high diagonality indicates these heads are merely identity mappings and thus they are redundant.…”

Section: Stochastic Attention Head Removalmentioning

confidence: 91%

“…Zhou et al [21] proposed a similar method of training selfattention models for natural language processing (NLP). However, the related work does not analyse where the benefits of such methods come from, whilst in contrast, we independently derive our method based on our previous work which studies the usefulness of the attention heads [11]. Additionally, in this work, our detailed analysis of the proposed method reveals why dynamically removing attention heads is helpful.…”

Section: Related Workmentioning

confidence: 99%

“…Since self-attention encodes sequences through attention mechanisms, it can model dependencies within any range. However, previous works have shown it is less effective in capturing local information [11,22]. Nevertheless, acoustic events usually happen within short periods and thus local information is essential for ASR tasks.…”

Section: Transformer Based Modelsmentioning

confidence: 99%

“…We use WSJ as the dataset and the experimental setups are presented in Section 5. For the baseline model, following our previous work, we plot the heatmap of the averaged diagonality [11] on WSJ eval92 for each head in every encoder layer. Then, we remove the heads whose averaged diagonality are above a threshold and retrain the model.…”

Section: Stochastic Attention Head Removalmentioning

confidence: 99%

“…In self-attention based models such as Transformers [1], each self-attention layer uses multi-head attention to capture a set of different inputs representations, and have given competitive ASR results [2][3][4][5][6][7][8][9][10]. However, we previously observed that not all of the attention heads are useful [11]. We found that in trained Transformer ASR models, the attention matrices of some attention heads are close to the identity matrices, indicating that the attention mechanism is just an identity mapping.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Stochastic Attention Head Removal: A Simple and Effective Method for Improving Transformer Based ASR Models

Zhang¹,

Loweimi

Bell³

et al. 2021

Interspeech 2021

Self Cite

View full text Add to dashboard Cite

Recently, Transformer based models have shown competitive automatic speech recognition (ASR) performance. One key factor in the success of these models is the multi-head attention mechanism. However, for trained models, we have previously observed that many attention matrices are close to diagonal, indicating the redundancy of the corresponding attention heads. We have also found that some architectures with reduced numbers of attention heads have better performance. Since the search for the best structure is time prohibitive, we propose to randomly remove attention heads during training and keep all attention heads at test time, thus the final model is an ensemble of models with different architectures. The proposed method also forces each head independently learn the most useful patterns. We apply the proposed method to train Transformer based and Convolution-augmented Transformer (Conformer) based ASR models. Our method gives consistent performance gains over strong baselines on the Wall Street Journal, AISHELL, Switchboard and AMI datasets. To the best of our knowledge, we have achieved state-of-the-art end-to-end Transformer based model performance on Switchboard and AMI.

show abstract

Section: Stochastic Attention Head Removalmentioning

confidence: 91%

Section: Related Workmentioning

confidence: 99%

Section: Transformer Based Modelsmentioning

confidence: 99%

Section: Stochastic Attention Head Removalmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Stochastic Attention Head Removal: A Simple and Effective Method for Improving Transformer Based ASR Models

Zhang¹,

Loweimi

Bell³

et al. 2021

Interspeech 2021

Self Cite

View full text Add to dashboard Cite

show abstract

RAttSR: A Novel Low-Cost Reconstructed Attention-Based End-to-End Speech Recognizer

Paul,

Phadikar

2023

Circuits Syst Signal Process

View full text Add to dashboard Cite

Exploring emergent syllables in end-to-end automatic speech recognizers through model explainability technique

Vitale,

Cutugno,

Origlia

et al. 2024

Neural Comput & Applic

View full text Add to dashboard Cite

Automatic speech recognition systems based on end-to-end models (E2E-ASRs) can achieve comparable performance to conventional ASR systems while reproducing all their essential parts automatically, from speech units to the language model. However, they hide the underlying perceptual processes modelled, if any, and they have lower adaptability to multiple application contexts, and, furthermore, they require powerful hardware and an extensive amount of training data. Model-explainability techniques can explore the internal dynamics of these ASR systems and possibly understand and explain the processes conducting to their decisions and outputs. Understanding these processes can help enhance ASR performance and reduce the required training data and hardware significantly. In this paper, we probe the internal dynamics of three E2E-ASRs pre-trained for English by building an acoustic-syllable boundary detector for Italian and Spanish based on the E2E-ASRs’ internal encoding layer outputs. We demonstrate that the shallower E2E-ASR layers spontaneously form a rhythmic component correlated with prominent syllables, central in human speech processing. This finding highlights a parallel between the analysed E2E-ASRs and human speech recognition. Our results contribute to the body of knowledge by providing a human-explainable insight into behaviours encoded in popular E2E-ASR systems.

show abstract

On The Usefulness of Self-Attention for Automatic Speech Recognition with Transformers

Cited by 25 publications

References 23 publications

Stochastic Attention Head Removal: A Simple and Effective Method for Improving Transformer Based ASR Models

Stochastic Attention Head Removal: A Simple and Effective Method for Improving Transformer Based ASR Models

RAttSR: A Novel Low-Cost Reconstructed Attention-Based End-to-End Speech Recognizer

Exploring emergent syllables in end-to-end automatic speech recognizers through model explainability technique

Contact Info

Product

Resources

About