2022
DOI: 10.1109/tgrs.2022.3173811
|View full text |Cite
|
Sign up to set email alerts
|

From Easy to Hard: Learning Language-Guided Curriculum for Visual Question Answering on Remote Sensing Data

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
21
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3

Relationship

2
5

Authors

Journals

citations
Cited by 28 publications
(21 citation statements)
references
References 52 publications
0
21
0
Order By: Relevance
“…For instance, AOA [33] handles the common-seen question better than the others (e.g., Number and Yes/No are classical questions for natural images), while our method performs better in the positioning related questions (e.g., Size, Location, Shape). (3) CNN models [3,25,43,47] incline to perform better than transformers [33,45,50] in the term of Color questions, which is nearly 25% gap. On the other hand, these questions are always about the information of isolated objects (e.g., small vehicle, large vehicle, ship), which are significantly smaller than the others.…”
Section: Benchmark Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…For instance, AOA [33] handles the common-seen question better than the others (e.g., Number and Yes/No are classical questions for natural images), while our method performs better in the positioning related questions (e.g., Size, Location, Shape). (3) CNN models [3,25,43,47] incline to perform better than transformers [33,45,50] in the term of Color questions, which is nearly 25% gap. On the other hand, these questions are always about the information of isolated objects (e.g., small vehicle, large vehicle, ship), which are significantly smaller than the others.…”
Section: Benchmark Resultsmentioning
confidence: 99%
“…We could see the significant improvements with our proposed GFTransformer. Specifically, compared to the state-of-the-art VQA model designed for aerial images [47], 2.45% overall accuracy and 2.70% average accuracy boosts are obtained on Test-Phila. Besides, we perform better than the models designed for natural scenes [3,29,33,43,50] in general with 1.53% overall accuracy and 1.74% average accuracy improvements.…”
Section: Results On Rsvqamentioning
confidence: 99%
See 1 more Smart Citation
“…If we can provide an answer to the above question in natural language instead of maps, it will make users understand better. In this context, tasks of combining remote sensing images and natural language, such as image captioning [8,9] and visual question answering (VQA) [10,11], attract a lot of attention in the remote sensing community nowadays.…”
Section: Introductionmentioning
confidence: 99%
“…Recently, a multilayer aggregated Transformer was utilized to extract information for caption generation [30]. Regarding VQA for remote sensing data (RSVQA), Lobry et al [31] first introduced this task, built two datasets, and used a hybrid CNN-RNN model to extract feature, and Yuan et al [32] proposed a self-paced curriculum learning based model trained from easy to hard questions gradually.…”
mentioning
confidence: 99%