Image and Signal Processing for Remote Sensing XXVIII 2022
DOI: 10.1117/12.2636276
|View full text |Cite
|
Sign up to set email alerts
|

Multi-modal fusion transformer for visual question answering in remote sensing

Abstract: With the new generation of satellite technologies, the archives of remote sensing (RS) images are growing very fast. To make the intrinsic information of each RS image easily accessible, visual question answering (VQA) has been introduced in RS. VQA allows a user to formulate a free-form question concerning the content of RS images to extract generic information. It has been shown that the fusion of the input modalities (i.e., image and text) is crucial for the performance of VQA systems. Most of the current f… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 11 publications
(6 citation statements)
references
References 14 publications
0
5
0
Order By: Relevance
“…In CV, transformers have demonstrated advantages in processing multimodal data due to their more general and flexible modeling space [ 74 ]. Consequently, researchers started employing transformers to address multimodal problems in RS image text retrieval [ 75 ] and RS visual question answering [ 76 ]. Currently, there exists a scarcity of large-scale multimodal datasets, leading to researchers’ need to collect multimodal data by themselves.…”
Section: Challengesmentioning
confidence: 99%
“…In CV, transformers have demonstrated advantages in processing multimodal data due to their more general and flexible modeling space [ 74 ]. Consequently, researchers started employing transformers to address multimodal problems in RS image text retrieval [ 75 ] and RS visual question answering [ 76 ]. Currently, there exists a scarcity of large-scale multimodal datasets, leading to researchers’ need to collect multimodal data by themselves.…”
Section: Challengesmentioning
confidence: 99%
“…These prompts were then input into a language model for answer prediction. Siebert et al [21] employed the VisualBERT [7] model to better learn joint representation.…”
Section: Remote Sensing Vqamentioning
confidence: 99%
“…It is worth noting that Bazi et al [19] introduced an encoderdecoder structure, increasing the model's complexity, while Siebert et al [21] performed full fine-tuning on VisualBERT [7], requiring substantial computational resources and runtime. Compared to these existing transformer-based methods, RSAdapter achieves efficient fine-tuning without increasing model complexity, saving training time and computational resources.…”
Section: Remote Sensing Vqamentioning
confidence: 99%
See 1 more Smart Citation
“…The results indicate that more complex fusion strategies yield higher accuracies. More recently, Siebert et al [37], proposed a VQA model that uses a multi-modal fusing module based on VisualBERT to integrate the image and the language modalities.…”
Section: A Vqamentioning
confidence: 99%