Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2021
DOI: 10.18653/v1/2021.emnlp-main.132
|View full text |Cite
|
Sign up to set email alerts
|

Cross-Attention is All You Need: Adapting Pretrained Transformers for Machine Translation

Abstract: We study the power of cross-attention in the Transformer architecture within the context of transfer learning for machine translation, and extend the findings of studies into crossattention when training from scratch. We conduct a series of experiments through finetuning a translation model on data where either the source or target language has changed. These experiments reveal that fine-tuning only the cross-attention parameters is nearly as effective as fine-tuning all parameters (i.e., the entire translatio… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
14
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
4
2

Relationship

1
9

Authors

Journals

citations
Cited by 40 publications
(14 citation statements)
references
References 27 publications
0
14
0
Order By: Relevance
“…Subsequently, it was widely applied to a variety of tasks, e.g. image-text classification ( Lee et al , 2018 ) and machine translation ( Gheini et al , 2021 ). These applications have demonstrated the cross-attention mechanism enabling to construct explicit interaction between two separate inputs to fully take advantage of their correlation.…”
Section: Methodsmentioning
confidence: 99%
“…Subsequently, it was widely applied to a variety of tasks, e.g. image-text classification ( Lee et al , 2018 ) and machine translation ( Gheini et al , 2021 ). These applications have demonstrated the cross-attention mechanism enabling to construct explicit interaction between two separate inputs to fully take advantage of their correlation.…”
Section: Methodsmentioning
confidence: 99%
“…All of these existing approaches do not use gaze-signal as input and report loss in accuracy if they do. Some recent approaches also leverage attention-transformers for multi-modal learning, for example: [Gheini et al 2021] uses cross-attention to avoid fine-tuning for language translation models; [Mohla et al 2020] uses attention from Lidar and content from spectral imaging to combine them for image-segmentation; and [Ye et al 2019] uses attention-transformers to segment out the object described in the form of text from a given image. CMA, on the other hand, infers the spatio-temporal relationships across different modalities by combining information from all the modalities via attention-transformers [Vaswani et al 2017] and adaptively updates features for each modality to disseminate the global information from all the modalities.…”
Section: Multi-modal Fusionmentioning
confidence: 99%
“…We decided to follow this protocol, in order to isolate the effects on the final BLEU score on the ablated component, and to also prevent the other components from compensating. In concurrent work, Gheini et al (2021) have considered a similar experimental protocol, but to study a different but related phenomenon. In Figure 8, we show the ablation results for the en→de direction.…”
Section: B1 Supervised Translation Ablationsmentioning
confidence: 99%