DCU-UvA Multimodal MT System Report

Calixto, Iacer; Elliott, Desmond; Frank, Stella

doi:10.18653/v1/w16-2359

Cited by 38 publications

(45 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Most existing work obtain the best results by combining the penultimate layer of the CNN (via concatenation, summation, etc.) with the final state of the source sentence representation and using it to initialize the target RNN (Caglayan et al, 2016;Calixto et al, 2016;Huang et al, 2016). Recent work also explores an attention mechanism where they use lower level CNN features of the images, such as a convolutional layer, and condition the source and the target sentences on the image features (Calixto et al, 2016;Calixto et al, 2017).…”

Section: Multimodal Machine Translation Approachesmentioning

confidence: 99%

“…with the final state of the source sentence representation and using it to initialize the target RNN (Caglayan et al, 2016;Calixto et al, 2016;Huang et al, 2016). Recent work also explores an attention mechanism where they use lower level CNN features of the images, such as a convolutional layer, and condition the source and the target sentences on the image features (Calixto et al, 2016;Calixto et al, 2017). The intuition here is that the lower-level CNN features capture information about different areas of the images and an attention mechanism could learn to attend to specific regions while both encoding the source and decoding the target sentence.…”

Section: Multimodal Machine Translation Approachesmentioning

confidence: 99%

See 1 more Smart Citation

The role of image representations in vision to language tasks

Madhyastha¹,

Wang²,

Specia³

2018

Nat. Lang. Eng.

View full text Add to dashboard Cite

Tasks that require modeling of both language and visual information such as image captioning have become very popular in recent years. Most state-of-the-art approaches make use of image representations obtained from a deep neural network, which are used to generate language information in a variety of ways with end-to-end neural network-based models. However, it is not clear how different image representations contribute to language generation tasks. In this paper, we probe the representational contribution of the image features in an end-to-end neural modeling framework and study the properties of different types of image representations. We focus on two popular vision to language problems: the task of image captioning and the task of multimodal machine translation. Our analysis provides interesting insights into the representational properties and suggests that endto-end approaches implicitly learn a visual-semantic subspace and exploit the subspace to generate captions.

show abstract

Section: Multimodal Machine Translation Approachesmentioning

confidence: 99%

Section: Multimodal Machine Translation Approachesmentioning

confidence: 99%

The role of image representations in vision to language tasks

Madhyastha¹,

Wang²,

Specia³

2018

Nat. Lang. Eng.

View full text Add to dashboard Cite

show abstract

“…We then briefly discuss the doubly-attentive multi-modal NMT model we use in our experiments ( §2.3), which is comparable to the model evaluated by Calixto et al (2016) and further detailed and analysed in Calixto et al (2017a). …”

Section: Mt Models Evaluated In This Workmentioning

confidence: 99%

Proceedings of the Sixth Workshop on Vision and Language

2017

View full text Add to dashboard Cite

“…Multimodal NMT systems have been introduced (Elliott et al, 2015;Caglayan et al, 2016;Calixto et al, 2016;Huang et al, 2016) to incorporate visual information into NMT approaches, most of which condition the NMT on an image representation (typi-*P. Madhyastha and J. Wang contributed equally to this work.…”

Section: Introductionmentioning

confidence: 99%

“…They also incorporate attention mechanisms (Calixto et al, 2016). However, the effect of image features or the efficacy of the representational contribution is still an open research question.…”

Section: Introductionmentioning

confidence: 99%

Sheffield MultiMT: Using Object Posterior Predictions for Multimodal Machine Translation

Madhyastha¹,

Wang²,

Specia³

2017

Proceedings of the Second Conference on Machine Translation

View full text Add to dashboard Cite

This paper describes the University of Sheffield's submission to the WMT17 Multimodal Machine Translation shared task. We participated in Task 1 to develop an MT system to translate an image description from English to German and French, given its corresponding image. Our proposed systems are based on the state-of-the-art Neural Machine Translation approach. We investigate the effect of replacing the commonly-used image embeddings with an estimated posterior probability prediction for 1,000 object categories in the images.

show abstract

DCU-UvA Multimodal MT System Report

Cited by 38 publications

References 5 publications

The role of image representations in vision to language tasks

The role of image representations in vision to language tasks

Proceedings of the Sixth Workshop on Vision and Language

Sheffield MultiMT: Using Object Posterior Predictions for Multimodal Machine Translation

Contact Info

Product

Resources

About