Multi-modal neural machine translation with deep semantic interactions

Su, Jinsong; Chen, Jinchang; Jiang, Hui; Zhou, Chulun; Lin, Huan; Ge, Yubin; Wu, Qingqiang; Lai, Yongxuan

doi:10.1016/j.ins.2020.11.024

Cited by 28 publications

(13 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, they mainly focus on textual tasks. They cannot effectively deal with the multi-modal tasks, such as image-text retrieval, image captioning, multimodal machine translation (Lin et al, 2020a;Su et al, 2021) and visual dialog (Murahari et al, 2020).…”

Section: Text Enhance Visionmentioning

confidence: 99%

UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning

Li¹,

Gao²,

Niu³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

172

View full text Add to dashboard Cite

Existed pre-training methods either focus on single-modal tasks or multi-modal tasks, and cannot effectively adapt to each other. They can only utilize single-modal data (i.e., text or image) or limited multi-modal data (i.e., image-text pairs). In this work, we propose a UNIfied-MOdal pre-training architecture, namely UNIMO, which can effectively adapt to both single-modal and multi-modal understanding and generation tasks. Large scale of free text corpus and image collections are utilized to improve the capability of visual and textual understanding, and crossmodal contrastive learning (CMCL) is leveraged to align the textual and visual information into a unified semantic space, over a corpus of image-text pairs augmented with related images and texts. With the help of rich non-paired single-modal data, our model is able to learn more generalizable representations, by allowing textual knowledge and visual knowledge to enhance each other in the unified semantic space. The experimental results show that UNIMO greatly improves the performance of several singlemodal and multi-modal downstream tasks. Our code and pre-trained models are public at https:

show abstract

Section: Text Enhance Visionmentioning

confidence: 99%

UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning

Li¹,

Gao²,

Niu³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

172

View full text Add to dashboard Cite

show abstract

“…Both types of features have been used in various vision and language tasks such as multimodal dialogue sentiment analysis (Firdaus et al, 2020), image captioning (Xu et al, 2015;Shi et al, 2021), and multimodal machine translation (Ive et al, 2019;Lin et al, 2020;Su et al, 2021).…”

Section: Image Featuresmentioning

confidence: 99%

MSCTD: A Multimodal Sentiment Chat Translation Dataset

Liang¹,

Meng²,

Xu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Multimodal machine translation and textual chat translation have received considerable attention in recent years. Although the conversation in its natural form is usually multimodal, there still lacks work on multimodal machine translation in conversations. In this work, we introduce a new task named Multimodal Chat Translation (MCT), aiming to generate more accurate translations with the help of the associated dialogue history and visual context. To this end, we firstly construct a Multimodal Sentiment Chat Translation Dataset (MSCTD) containing 142,871 English-Chinese utterance pairs in 14,762 bilingual dialogues and 30,370 English-German utterance pairs in 3,079 bilingual dialogues. Each utterance pair, corresponding to the visual context that reflects the current conversational scene, is annotated with a sentiment label. Then, we benchmark the task by establishing multiple baseline systems that incorporate multimodal and sentiment features for MCT. Preliminary experiments on four language directions (English↔Chinese and English↔German) verify the potential of contextual and multimodal information fusion and the positive impact of sentiment on the MCT task. Additionally, as a by-product of the MSCTD, it also provides two new benchmarks on multimodal dialogue sentiment analysis. Our work can facilitate research on both multimodal chat translation and multimodal dialogue sentiment analysis. 1

show abstract

“…This is done to learn the bidirectional multi-modal translation simultaneously. Moreover, Su et al (2021) showed that jointly learning text-image interaction instead of modeling them separately using attentional networks is more useful. This result is in line with several state-of-the-art visual transformer related models, such as VisualBERT (Li et al, 2019), UNITER (Chen et al, 2019) etc.…”

Section: Related Workmentioning

confidence: 99%

IITP at WAT 2021: System description for English-Hindi Multimodal Translation Task

Gain¹,

Bandyopadhyay²,

Ekbal³

2021

Proceedings of the 8th Workshop on Asian Translation (WAT2021)

View full text Add to dashboard Cite

Neural Machine Translation (NMT) is a predominant machine translation technology nowadays because of its end-to-end trainable flexibility. However, NMT still struggles to translate properly in low-resource settings specifically on distant language pairs. One way to overcome this is to use the information from other modalities if available. The idea is that despite differences in languages, both the source and target language speakers see the same thing and the visual representation of both the source and target is the same, which can positively assist the system. Multimodal information can help the NMT system to improve the translation by removing ambiguity on some phrases or words. We participate in the 8th Workshop on Asian Translation (WAT -2021) for English-Hindi multimodal translation task and achieve 42.47 and 37.50 BLEU points for Evaluation and Challenge subset, respectively.

show abstract

Multi-modal neural machine translation with deep semantic interactions

Cited by 28 publications

References 17 publications

UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning

UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning

MSCTD: A Multimodal Sentiment Chat Translation Dataset

IITP at WAT 2021: System description for English-Hindi Multimodal Translation Task

Contact Info

Product

Resources

About