Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer 2021
DOI: 10.18653/v1/2021.acl-long.152
|View full text |Cite
|
Sign up to set email alerts
|

Contributions of Transformer Attention Heads in Multi- and Cross-lingual Tasks

Abstract: This paper studies the relative importance of attention heads in Transformer-based models to aid their interpretability in cross-lingual and multi-lingual tasks. Prior research has found that only a few attention heads are important in each mono-lingual Natural Language Processing (NLP) task and pruning the remaining heads leads to comparable or improved performance of the model. However, the impact of pruning attention heads is not yet clear in cross-lingual and multi-lingual tasks. Through extensive experime… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
2
2
2

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(3 citation statements)
references
References 24 publications
0
3
0
Order By: Relevance
“…To identify and remove interference, we need a metric that can separate harmful, unimportant, and beneficial attention heads. Prior work (Michel et al, 2019;Ma et al, 2021) utilized the magnitude of gradients as an importance metric. However, this metric measures the sensitivity of the loss function to the masking of a particular head.…”
Section: Methodsmentioning
confidence: 99%
“…To identify and remove interference, we need a metric that can separate harmful, unimportant, and beneficial attention heads. Prior work (Michel et al, 2019;Ma et al, 2021) utilized the magnitude of gradients as an importance metric. However, this metric measures the sensitivity of the loss function to the masking of a particular head.…”
Section: Methodsmentioning
confidence: 99%
“…To identify and remove interference, we need a metric which can distinguish harmful, unimportant, and beneficial attention heads. Prior work (Michel et al, 2019;Ma et al, 2021) utilized the magnitude of gradients as an importance metric. However, this metric measures the sensitivity of the loss function to the masking of a particular head regardless of the direction of that sensitivity.…”
Section: Methodsmentioning
confidence: 99%
“…These works adapt a model for the target language by training only Adapters Pfeiffer et al, 2020;Ansell et al, 2021), prompts (Zhao and Schütze, 2021), or subsets of model parameters (Ansell et al, 2022). Ma et al (2021) previously investigated pruning in multilingual models using gradient-based importance metrics to study variability across attention heads. However, they used a process of iterative pruning and language-specific finetuning.…”
Section: Multilingual Learningmentioning
confidence: 99%