2020
DOI: 10.3390/rs12060939
|View full text |Cite
|
Sign up to set email alerts
|

A Multi-Level Attention Model for Remote Sensing Image Captions

Abstract: The task of image captioning involves the generation of a sentence that can describe an image appropriately, which is the intersection of computer vision and natural language. Although the research on remote sensing image captions has just started, it has great significance. The attention mechanism is inspired by the way humans think, which is widely used in remote sensing image caption tasks. However, the attention mechanism currently used in this task is mainly aimed at images, which is too simple to express… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
15
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 38 publications
(23 citation statements)
references
References 33 publications
0
15
0
Order By: Relevance
“…The current state-of-the-art results for remote sensing image captioning were reported by Li et al [3], leveraging a novel multi-level attention mechanism that uses three attention structures: one focusing on the different image regions; another focusing on the previously generated words; and a separate one deciding whether to attend to the image information or the caption information. In addition, these authors have used and released modified versions of the three aforementioned datasets, correcting a variety of errors in the textual descriptions associated to the images.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…The current state-of-the-art results for remote sensing image captioning were reported by Li et al [3], leveraging a novel multi-level attention mechanism that uses three attention structures: one focusing on the different image regions; another focusing on the previously generated words; and a separate one deciding whether to attend to the image information or the caption information. In addition, these authors have used and released modified versions of the three aforementioned datasets, correcting a variety of errors in the textual descriptions associated to the images.…”
Section: Related Workmentioning
confidence: 99%
“…In our work, we also assessed model performance using UCM and RSCID, thus considering the larger dataset that is currently available in the area, as well as a smaller dataset that allows us to see how performance varies as a function of the available amount of training data. It should nonetheless be noted that the values in Table 1 are not all directly comparable, due to the fact that different studies used either the original or the updated versions of the datasets from Li et al [3], or due to small differences in the computation of the evaluation metrics (e.g., different estimates for the IDF term weight component in metrics such as CIDEr).…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Residual attention [20] is integrated to train a stacked deep neural network to leverage the classification accuracy. Several attention-based image retrievals had been explored in different studies [18,[21][22][23][24][36][37][38][39].…”
Section: Satellite Image Retrievalmentioning
confidence: 99%
“…Then hashing layer is appended to learn binary hash codes. Different loss function [18,[21][22][23][24][36][37][38][39] had been explored including point-wise similarity, pair-wise similarity [37,39] and triplet-wise supervision [38] to learn binary hash codes. In [26], different loss had been considered to train to generic feature.…”
Section: Deep Hashingmentioning
confidence: 99%