Contributions of Transformer Attention Heads in Multi- and Cross-lingual Tasks

Ma, Weicheng; Zhang, Kai; Lou, Renze; Wang, Lili; Vosoughi, Soroush

doi:10.18653/v1/2021.acl-long.152

Cited by 6 publications

(3 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To identify and remove interference, we need a metric that can separate harmful, unimportant, and beneficial attention heads. Prior work (Michel et al, 2019;Ma et al, 2021) utilized the magnitude of gradients as an importance metric. However, this metric measures the sensitivity of the loss function to the masking of a particular head.…”

Section: Methodsmentioning

confidence: 99%

Shapley Head Pruning: Identifying and Removing Interference in Multilingual Transformers

Held,

Yang

2023

Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

View full text Add to dashboard Cite

Multilingual transformer-based models demonstrate remarkable zero and few-shot transfer across languages by learning and reusing language-agnostic features. However, as a fixed-size model acquires more languages, its performance across all languages degrades. Those who attribute this interference phenomenon to limited model capacity address the problem by adding additional parameters, despite evidence that transformer-based models are overparameterized. In this work, we show that it is possible to reduce interference by instead identifying and pruning language-specific attention heads. First, we use Shapley Values, a credit allocation metric from coalitional game theory, to identify attention heads that introduce interference. Then, we show that pruning such heads from a fixed model improves performance for a target language on both sentence classification and structural prediction. Finally, we provide insights on language-agnostic and language-specific attention heads using attention visualization. 1

show abstract

Section: Methodsmentioning

confidence: 99%

Shapley Head Pruning: Identifying and Removing Interference in Multilingual Transformers

Held,

Yang

2023

Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

View full text Add to dashboard Cite

show abstract

“…To identify and remove interference, we need a metric which can distinguish harmful, unimportant, and beneficial attention heads. Prior work (Michel et al, 2019;Ma et al, 2021) utilized the magnitude of gradients as an importance metric. However, this metric measures the sensitivity of the loss function to the masking of a particular head regardless of the direction of that sensitivity.…”

Section: Methodsmentioning

confidence: 99%

“…These works adapt a model for the target language by training only Adapters Pfeiffer et al, 2020;Ansell et al, 2021), prompts (Zhao and Schütze, 2021), or subsets of model parameters (Ansell et al, 2022). Ma et al (2021) previously investigated pruning in multilingual models using gradient-based importance metrics to study variability across attention heads. However, they used a process of iterative pruning and language-specific finetuning.…”

Section: Multilingual Learningmentioning

confidence: 99%

Shapley Head Pruning: Identifying and Removing Interference in Multilingual Transformers

Held¹,

Yang²

2022

Preprint

View full text Add to dashboard Cite

Multilingualtransformer-based models demonstrate remarkable zero and few-shot transfer across languages by learning and reusing language-agnostic features. However, as a fixed-size model acquires more languages, its performance across all languages degrades, a phenomenon termed interference. Often attributed to limited model capacity, interference is commonly addressed by adding additional parameters despite evidence that transformer-based models are overparameterized. In this work, we show that it is possible to reduce interference by instead identifying and pruning language-specific parameters. First, we use Shapley Values, a credit allocation metric from coalitional game theory, to identify attention heads that introduce interference. Then, we show that removing identified attention heads from a fixed model improves performance for a target language on both sentence classification and structural prediction, seeing gains as large as 24.7%. Finally, we provide insights on language-agnostic and language-specific attention heads using attention visualization.

show abstract

Cross-Modality Spatial-Temporal Transformer for Video-Based Visible-Infrared Person Re-Identification

Feng,

Chen,

et al. 2024

IEEE Trans. Multimedia

View full text Add to dashboard Cite

Contributions of Transformer Attention Heads in Multi- and Cross-lingual Tasks

Cited by 6 publications

References 24 publications

Shapley Head Pruning: Identifying and Removing Interference in Multilingual Transformers

Shapley Head Pruning: Identifying and Removing Interference in Multilingual Transformers

Shapley Head Pruning: Identifying and Removing Interference in Multilingual Transformers

Cross-Modality Spatial-Temporal Transformer for Video-Based Visible-Infrared Person Re-Identification

Contact Info

Product

Resources

About