Unraveling the Contribution of Image Captioning and Neural Machine Translation for Multimodal Machine Translation

Lala, Chiraag; Madhyastha, Pranava; Wang, Josiah; Specia, Lucia

doi:10.1515/pralin-2017-0020

Cited by 6 publications

(7 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Similarly to Lala et al (2017), our oracle experiments on the validation data showed that rescoring of the decoded beam of width 100 has the potential of improvement of up to 3 METEOR points. In the oracle experiment, we always chose a sentence with the highest sentence-level BLEU score.…”

Section: Beam Rescoringmentioning

confidence: 70%

CUNI System for the WMT17 Multimodal Translation Task

Helcl¹,

Libovický²

2017

Proceedings of the Second Conference on Machine Translation

View full text Add to dashboard Cite

In this paper, we describe our submissions to the WMT17 Multimodal Translation Task. For Task 1 (multimodal translation), our best scoring system is a purely textual neural translation of the source image caption to the target language. The main feature of the system is the use of additional data that was acquired by selecting similar sentences from parallel corpora and by data synthesis with back-translation. For Task 2 (cross-lingual image captioning), our best submitted system generates an English caption which is then translated by the best system used in Task 1. We also present negative results, which are based on ideas that we believe have potential of making improvements, but did not prove to be useful in our particular setup.

show abstract

Section: Beam Rescoringmentioning

confidence: 70%

CUNI System for the WMT17 Multimodal Translation Task

Helcl¹,

Libovický²

2017

Proceedings of the Second Conference on Machine Translation

View full text Add to dashboard Cite

show abstract

“…Focusing on MMT, Lala et al (2017) show that, given reliable image information in the form of captions, an ideal MMT system would be able to significantly benefit and obtain better translations. Vinyals et al (2016) and Karpathy et al (2016) present an analysis of lexical and syntactic properties of the generated captions.…”

Section: Studying Visual Representationsmentioning

confidence: 99%

“…Focusing on MMT, Lala et al Lala et al (2017) show that, given reliable image information in the form of captions, an ideal MMT system would be able to significantly benefit and obtain better translations.…”

Section: Background and Related Workmentioning

confidence: 99%

The role of image representations in vision to language tasks

Madhyastha¹,

Wang²,

Specia³

2018

Nat. Lang. Eng.

Self Cite

View full text Add to dashboard Cite

Tasks that require modeling of both language and visual information such as image captioning have become very popular in recent years. Most state-of-the-art approaches make use of image representations obtained from a deep neural network, which are used to generate language information in a variety of ways with end-to-end neural network-based models. However, it is not clear how different image representations contribute to language generation tasks. In this paper, we probe the representational contribution of the image features in an end-to-end neural modeling framework and study the properties of different types of image representations. We focus on two popular vision to language problems: the task of image captioning and the task of multimodal machine translation. Our analysis provides interesting insights into the representational properties and suggests that endto-end approaches implicitly learn a visual-semantic subspace and exploit the subspace to generate captions.

show abstract

“…Multimodal content is gaining popularity in machine translation (MT) community due to its appealing chances to improve translation quality and its usage in commercial applications such as image caption translation for online news articles or machine translation for e-commerce product listings [1,2,3,4]. Although the general performance of neural machine translation (NMT) models is very good given large amounts of parallel texts, some inputs can remain genuinely ambiguous, especially if the input context is limited.…”

Section: Introductionmentioning

confidence: 99%

Hindi Visual Genome: A Dataset for Multi-Modal English to Hindi Machine Translation

Parida¹,

Bojar²,

Dash³

2019

CyS

View full text Add to dashboard Cite

Visual Genome is a dataset connecting structured image information with English language. We present "Hindi Visual Genome", a multimodal dataset consisting of text and images suitable for English-Hindi multimodal machine translation task and multimodal research. We have selected short English segments (captions) from Visual Genome along with associated images and automatically translated them to Hindi with manual post-editing which took the associated images into account. We prepared a set of 31525 segments, accompanied by a challenge test set of 1400 segments. This challenge test set was created by searching for (particularly) ambiguous English words based on the embedding similarity and manually selecting those where the image helps to resolve the ambiguity. Our dataset is the first for multimodal English-Hindi machine translation, freely available for noncommercial research purposes. Our Hindi version of Visual Genome also allows to create Hindi image labelers or other practical tools. Hindi Visual Genome also serves in Workshop on Asian Translation (WAT) 2019 Multi-Modal Translation Task.

show abstract

Unraveling the Contribution of Image Captioning and Neural Machine Translation for Multimodal Machine Translation

Cited by 6 publications

References 10 publications

CUNI System for the WMT17 Multimodal Translation Task

CUNI System for the WMT17 Multimodal Translation Task

The role of image representations in vision to language tasks

Hindi Visual Genome: A Dataset for Multi-Modal English to Hindi Machine Translation

Contact Info

Product

Resources

About