2022
DOI: 10.1109/jstars.2022.3215803
|View full text |Cite
|
Sign up to set email alerts
|

Multilanguage Transformer for Improved Text to Remote Sensing Image Retrieval

Abstract: Cross-modal text-image retrieval in remote sensing (RS) provides a flexible retrieval experience for mining useful information from RS repositories. However, existing methods are designed to accept queries formulated in the English language only, which may restrict accessibility to useful information for non-English speakers. Allowing multilanguage queries can enhance the communication with the retrieval system and broaden access to the RS information. To address this limitation, this article proposes a multil… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
8

Relationship

1
7

Authors

Journals

citations
Cited by 28 publications
(4 citation statements)
references
References 43 publications
0
4
0
Order By: Relevance
“…The Google Machine Translation team first proposes the Transformer model [19]. This model demonstrates strong sequential data modeling capabilities in the field of natural language processing.…”
Section: Methodology Of Informermentioning
confidence: 99%
“…The Google Machine Translation team first proposes the Transformer model [19]. This model demonstrates strong sequential data modeling capabilities in the field of natural language processing.…”
Section: Methodology Of Informermentioning
confidence: 99%
“…Furthermore, the realm of information retrieval for textual data has been extensively explored, resulting in the development of numerous text-oriented search engines [7]. AlRahhal et al [8] introduced a multilanguage framework centered around text transformers. This framework consists of two transformer encoders designed to learn representations specific to different modalities.…”
Section: Related Literaturementioning
confidence: 99%
“…Our study considers four distinct directions: horizontal, vertical, diagonal, and anti-diagonal. To quantify the run lengths, we use a logarithmic scale for quantization, which is structured as follows: [1], [2], [3][4], [5][6][7][8], [9][10][11][12][13][14][15][16], [17][18][19][20][21][22][23][24][25][26][27][28][29][30][31][32], , [65-128], [129-]. In the context of a binary image, run lengths can be computed separately for both the background and foreground.…”
Section: Global Image Features-based Dirmentioning
confidence: 99%
“…One of the key advantages of CLIP is its ability to generalize to new tasks and classes without the need for additional training, making it powerful and flexible enough to perform zero-shot learning. It is worth recalling that these models have been successfully applied to RS tasks related to RS image-text retrieval [3] and visual question answering [4,5].…”
Section: Introductionmentioning
confidence: 99%