Stochastic Attention Head Removal: A Simple and Effective Method for Improving Transformer Based ASR Models

Zhang, Shucong; Loweimi, Erfan; Bell, P. J.; Renals, Steve

doi:10.21437/interspeech.2021-280

Cited by 7 publications

(2 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We also compared the proposed models with other alternative systems including the end-to-end (E2E) stochastic attention head removal (SAHR) [70], multi-stream E2E [71] and the hybrid multi-scale octave CNNs [72], parametric (Parznet) 2- [73] 24.9 26.0 --Parznet 2D-CNN+VI (Hybrid) [12] 24. D CNNs without [73] and with variational inference (VI) [12].…”

Section: B Results and Discussionmentioning

confidence: 99%

Multi-Stream Acoustic Modelling Using Raw Real and Imaginary Parts of the Fourier Transform

Loweimi

Yue²,

Bell

et al. 2023

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

Section: B Results and Discussionmentioning

confidence: 99%

Multi-Stream Acoustic Modelling Using Raw Real and Imaginary Parts of the Fourier Transform

Loweimi

Yue²,

Bell

et al. 2023

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

“…The scaling factor s g compensates for the masked portion and maintains the statistics after gating is applied. Recently, attention head dropout [17,18] has been introduced with similar scaling, but their scaling is used for regularization during training. We found that this technique greatly stabilizes the training dynamics, especially after the training is stabilized by the above two methods.…”

Section: Techniques For Head Pruningmentioning

confidence: 99%

Layer-wise Pruning of Transformer Attention Heads for Efficient Language Modeling

Shim¹,

Choi²,

Sung³

et al. 2021

Preprint

View full text Add to dashboard Cite

While Transformer-based models have shown impressive language modeling performance, the large computation cost is often prohibitive for practical use. Attention head pruning, which removes unnecessary attention heads in the multihead attention, is a promising technique to solve this problem. However, it does not evenly reduce the overall load because the heavy feedforward module is not affected by head pruning. In this paper, we apply layerwise attention head pruning on All-attention [1] Transformer so that the entire computation and the number of parameters can be reduced proportionally to the number of pruned heads. While the architecture has the potential to fully utilize head pruning, we propose three training methods that are especially helpful to minimize performance degradation and stabilize the pruning process. Our pruned model shows consistently lower perplexity within a comparable parameter size than Transformer-XL on WikiText-103 language modeling benchmark.

show abstract

Attention to Phonetics: A Visually Informed Explanation of Speech Transformers

Shams,

Carson-Berndsen

2024

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Stochastic Attention Head Removal: A Simple and Effective Method for Improving Transformer Based ASR Models

Cited by 7 publications

References 26 publications

Multi-Stream Acoustic Modelling Using Raw Real and Imaginary Parts of the Fourier Transform

Multi-Stream Acoustic Modelling Using Raw Real and Imaginary Parts of the Fourier Transform

Layer-wise Pruning of Transformer Attention Heads for Efficient Language Modeling

Attention to Phonetics: A Visually Informed Explanation of Speech Transformers

Contact Info

Product

Resources

About