2019
DOI: 10.48550/arxiv.1911.09333
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Generating Diverse Translation by Manipulating Multi-Head Attention

Abstract: Transformer model (Vaswani et al. 2017) has been widely used in machine translation tasks and obtained state-of-theart results. In this paper, we report an interesting phenomenon in its encoder-decoder multi-head attention: different attention heads of the final decoder layer align to different word translation candidates. We empirically verify this discovery and propose a method to generate diverse translations by manipulating heads. Furthermore, we make use of these diverse translations with the back-transla… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
3
0

Year Published

2020
2020
2020
2020

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 22 publications
0
3
0
Order By: Relevance
“…In our experiment, we set the number of hidden states is 5. • Head Sampling (Sun et al, 2019): it generate different translations by sampling different encoder-decoder attention heads according to their attention weight, and copying the samples to other heads in some conditions. Here, we set the parameter K = 3.…”
Section: Results In Diverse Translationmentioning
confidence: 99%
See 2 more Smart Citations
“…In our experiment, we set the number of hidden states is 5. • Head Sampling (Sun et al, 2019): it generate different translations by sampling different encoder-decoder attention heads according to their attention weight, and copying the samples to other heads in some conditions. Here, we set the parameter K = 3.…”
Section: Results In Diverse Translationmentioning
confidence: 99%
“…He et al (2018) and Shen et al (2019) introduced latent variables into the NMT model, thus the model can generate diverse outputs using different latent variables. Moreover, Sun et al (2019) proposed to combine the structural characteristics of Transformer and use the different weights between each head in the multi-head attention mechanism to obtain diverse results. In spite of improvement in balancing accuracy and diversity, these methods do not represent the diversity in the NMT model directly.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation