Proceedings of the Third Workshop on Multimodal Artificial Intelligence 2021
DOI: 10.18653/v1/2021.maiworkshop-1.11
|View full text |Cite
|
Sign up to set email alerts
|

Semantic Aligned Multi-modal Transformer for Vision-LanguageUnderstanding: A Preliminary Study on Visual QA

Abstract: Recent vision-language understanding approaches adopt a multi-modal transformer pretraining and finetuning paradigm. Prior work learns representations of text tokens and visual features with cross-attention mechanisms and captures the alignment solely based on indirect signals. In this work, we propose to enhance the alignment mechanism by incorporating image scene graph structures as the bridge between the two modalities, and learning with new contrastive objectives. In our preliminary study on the challengin… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(1 citation statement)
references
References 13 publications
0
1
0
Order By: Relevance
“…Depth wise separable convolution is a type of convolution that makes the model to learn local relationships. Semantic Aligned Multi-modal Transformer (12) is utilized to enhance the alignment mechanism by incorporating image scene graph structures as a bridge between vision and language. (13) proposed MMFT-BERT.…”
Section: Transformer Based Approaches For Visual Question Answeringmentioning
confidence: 99%
“…Depth wise separable convolution is a type of convolution that makes the model to learn local relationships. Semantic Aligned Multi-modal Transformer (12) is utilized to enhance the alignment mechanism by incorporating image scene graph structures as a bridge between vision and language. (13) proposed MMFT-BERT.…”
Section: Transformer Based Approaches For Visual Question Answeringmentioning
confidence: 99%