Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics 2019
DOI: 10.18653/v1/p19-1653
|View full text |Cite
|
Sign up to set email alerts
|

Distilling Translations with Visual Awareness

Abstract: Previous work on multimodal machine translation has shown that visual information is only needed in very specific cases, for example in the presence of ambiguous words where the textual context is not sufficient. As a consequence, models tend to learn to ignore this information. We propose a translate-and-refine approach to this problem where images are only used by a second stage decoder. This approach is trained jointly to generate a good first draft translation and to improve over this draft by (i) making b… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
79
0
1

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 76 publications
(82 citation statements)
references
References 33 publications
2
79
0
1
Order By: Relevance
“…Apparently, how to fully exploit visual information is one of the core issues in multi-modal NMT, which directly impacts the model performance. To this end, a lot of efforts have been made, roughly consisting of: (1) encoding each input image into a global feature vector, which can be used to initialize different components of multi-modal NMT models, or as additional source tokens (Huang et al, 2016;, or to learn the joint multi-modal representation (Zhou et al, 2018;Calixto et al, 2019); (2) extracting object-based image features to initialize the model, or supplement source sequences, or generate attention-based visual context (Huang et al, 2016;Ive et al, 2019); and (3) representing each image as spatial features, which can be exploited as extra context Delbrouck and Dupont, 2017a;Ive et al, 2019), or a supplement to source semantics (Delbrouck and Dupont, 2017b) via an attention mechanism.…”
Section: Introductionmentioning
confidence: 99%
“…Apparently, how to fully exploit visual information is one of the core issues in multi-modal NMT, which directly impacts the model performance. To this end, a lot of efforts have been made, roughly consisting of: (1) encoding each input image into a global feature vector, which can be used to initialize different components of multi-modal NMT models, or as additional source tokens (Huang et al, 2016;, or to learn the joint multi-modal representation (Zhou et al, 2018;Calixto et al, 2019); (2) extracting object-based image features to initialize the model, or supplement source sequences, or generate attention-based visual context (Huang et al, 2016;Ive et al, 2019); and (3) representing each image as spatial features, which can be exploited as extra context Delbrouck and Dupont, 2017a;Ive et al, 2019), or a supplement to source semantics (Delbrouck and Dupont, 2017b) via an attention mechanism.…”
Section: Introductionmentioning
confidence: 99%
“…The former is quite similar to Arslan et al (2018) and simply performs additive fusion while the latter first applies the language attention, which produces the query vector for the subsequent visual attention. Ive et al (2019) extend Libovický et al (2018) to add a 2-stage decoding process where visual features are only used in the second stage, through a visual cross-modal attention. They also experiment with another model where the attention is applied over the embeddings of object labels detected from the images.…”
Section: Visual Attentionmentioning
confidence: 99%
“…For example, Calixto, Liu, and Campbell (2017) present a doubly-attentive decoder integrating two separate attention over the source information and a more effective hierarchical attention was proposed by Delbrouck and Dupont (2017). Ive, Madhyastha, and Specia (2019) propose an effective translation-and-refine framework, where visual features are only used by a second stage decoder. Inspired by multi-task learning, Elliott and Kádár (2017) perform machine translation while constraining the averaged representations of the shared encoder to be the visual embedding of the paired image.…”
Section: Related Workmentioning
confidence: 99%