“…Apparently, how to fully exploit visual information is one of the core issues in multi-modal NMT, which directly impacts the model performance. To this end, a lot of efforts have been made, roughly consisting of: (1) encoding each input image into a global feature vector, which can be used to initialize different components of multi-modal NMT models, or as additional source tokens (Huang et al, 2016;, or to learn the joint multi-modal representation (Zhou et al, 2018;Calixto et al, 2019); (2) extracting object-based image features to initialize the model, or supplement source sequences, or generate attention-based visual context (Huang et al, 2016;Ive et al, 2019); and (3) representing each image as spatial features, which can be exploited as extra context Delbrouck and Dupont, 2017a;Ive et al, 2019), or a supplement to source semantics (Delbrouck and Dupont, 2017b) via an attention mechanism.…”