Proceedings of the 29th International Conference on Advances in Geographic Information Systems 2021
DOI: 10.1145/3474717.3483631
|View full text |Cite
|
Sign up to set email alerts
|

Remote Sensing Image Captioning with Continuous Output Neural Models

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
2
2
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(2 citation statements)
references
References 1 publication
0
2
0
Order By: Relevance
“…Retrieved context or exemplars come from large-scale unlabeled collections or labeled sets for supervised learning [73,38]. Computer vision and multimodal models also benefit from retrieval augmentation [41,51]. A focus in NLP have been methods to adapt a pretrained parametric language model into one that can integrate information from retrieved sequences, and joint training of a retriever and generation model.…”
Section: Background and Related Workmentioning
confidence: 99%
“…Retrieved context or exemplars come from large-scale unlabeled collections or labeled sets for supervised learning [73,38]. Computer vision and multimodal models also benefit from retrieval augmentation [41,51]. A focus in NLP have been methods to adapt a pretrained parametric language model into one that can integrate information from retrieved sequences, and joint training of a retriever and generation model.…”
Section: Background and Related Workmentioning
confidence: 99%
“…ClipCap [113] leverages this collaboration without re-training CLIP or GPT-2; instead, a lightweight transformer-based mapping module is trained to match CLIP representations to GPT-2, which eventually generates the caption c. The CLIP-GPT-2 combination was also followed in VC-GPT [114]. Another lightweight improvement combining a pre-trained CLIP visual encoder and a frozen GPT-2 text decoder further boosts performance of low-resource approaches [115]. A cross-modal filter that selects the most relevant visual information, so that captioning errors are reduced is proposed in [116], still respecting the frozen CLIP-GPT-2 framework.…”
Section: Image Captioning (Ic)mentioning
confidence: 99%