Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022
DOI: 10.18653/v1/2022.acl-long.438
|View full text |Cite
|
Sign up to set email alerts
|

On Vision Features in Multimodal Machine Translation

Abstract: Previous work on multimodal machine translation (MMT) has focused on the way of incorporating vision features into translation but little attention is on the quality of vision models. In this work, we investigate the impact of vision models on MMT. Given the fact that Transformer is becoming popular in computer vision, we experiment with various strong models (such as Vision Transformer) and enhanced features (such as object-detection and image captioning). We develop a selective attention model to study the p… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
10
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 22 publications
(26 citation statements)
references
References 8 publications
0
10
0
Order By: Relevance
“…Gated fusion techniques are widely used in combining the representations from different modalities as is done in some of the previous works [9]. In this method, for any input sample which consists of image I, source text S and target text T , the image features are obtained using the OpenAI CLIP's Vision Transformer(ViT) model [14] as ViT(I) and the textual embeddings are obtained from the standard transformer encoder as H S .…”
Section: Gated Fusion Methodsmentioning
confidence: 99%
See 3 more Smart Citations
“…Gated fusion techniques are widely used in combining the representations from different modalities as is done in some of the previous works [9]. In this method, for any input sample which consists of image I, source text S and target text T , the image features are obtained using the OpenAI CLIP's Vision Transformer(ViT) model [14] as ViT(I) and the textual embeddings are obtained from the standard transformer encoder as H S .…”
Section: Gated Fusion Methodsmentioning
confidence: 99%
“…This method was implemented based on Multimodal Machine Translation where the sigmoid gate function was replaced 3: Sample Analysis with tanh. All the parameters were kept constant as in [9], except for the learning rate which was changed to 0.001 and the max updates to 800,000. For evaluation, the average of last 10 checkpoints was used for more reliable results.…”
Section: Gated Fusion Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…Then, we apply the gated fusion mechanism (Zhang et al, 2020;Wu et al, 2021;Li et al, 2022a) to fuse H language and H vision . The fused output H fuse ∈ R n×d is obtained by:…”
Section: Model Architecturementioning
confidence: 99%