Multi-modal machine translation (MMT) aims to augment the linguistic machine translation frameworks by incorporating aligned vision information. As the core research challenge for MMT, how to fuse the image information and further align it with the bilingual data remains critical. Existing works have either focused on a methodological alignment in the space of bilingual text or emphasized the combination of the one-sided text and given image. In this work, we entertain the possibility of a triplet alignment, among the source and target text together with the image instance.In particular, we propose Multi-aspect AlignmenT (MAT) model that augments the MMT tasks to three sub-tasks -namely crosslanguage translation alignment, cross-modal captioning alignment and multi-modal hybrid alignment tasks. Core to this model consists of a hybrid vocabulary which compiles the visually depictable entity (nouns) occurrence on both sides of the text as well as the detected object labels appearing in the images. Through this sub-task, we postulate that MAT manages to further align the modalities by casting three instances into a shared domain, as compared against previously proposed methods. Extensive experiments and analyses demonstrate the superiority of our approaches, which achieve several state-of-the-art results on two benchmark datasets of the MMT task.
CCS CONCEPTS• Computer systems organization → Machine translation; Multi-modal alignment.