With the new generation of satellite technologies, the archives of remote sensing (RS) images are growing very fast. To make the intrinsic information of each RS image easily accessible, visual question answering (VQA) has been introduced in RS. VQA allows a user to formulate a free-form question concerning the content of RS images to extract generic information. It has been shown that the fusion of the input modalities (i.e., image and text) is crucial for the performance of VQA systems. Most of the current fusion approaches use modalityspecific representations in their fusion modules instead of joint representation learning. However, to discover the underlying relation between both the image and question modality, the model is required to learn the joint representation instead of simply combining (e.g., concatenating, adding, or multiplying) the modality-specific representations. We propose a multi-modal transformer-based architecture to overcome this issue. Our proposed architecture consists of three main modules: i) the feature extraction module for extracting the modality-specific features; ii) the fusion module, which leverages a user-defined number of multi-modal transformer layers of the VisualBERT model (VB); and iii) the classification module to obtain the answer. In contrast to recently proposed transformer-based models in RS VQA, the presented architecture (called VBFusion) is not limited to specific questions, e.g., questions concerning pre-defined objects. Experimental results obtained on the RSVQAxBEN and RSVQA-LR datasets (which are made up of RGB bands of Sentinel-2 images) demonstrate the effectiveness of VBFusion for VQA tasks in RS. To analyze the importance of using other spectral bands for the description of the complex content of RS images in the framework of VQA, we extend the RSVQAxBEN dataset to include all the spectral bands of Sentinel-2 images with 10m and 20m spatial resolution. Experimental results show the importance of utilizing these bands to characterize the land-use land-cover classes present in the images in the framework of VQA. The code of the proposed method is publicly available at https://git.tu-berlin.de/rsim/multimodal-fusion-transformer-for-vqa-in-rs.
<p>As a result of advancements in satellite technology, archives of remote sensing (RS) images are continuously growing, providing a valuable source of information for monitoring the Earth's surface. Researchers construct well-designed and ready-to-use datasets from the plethora of RS images for the broader community to make it easier to develop and compare novel algorithms, models, and architectures to further deepen our understanding of our planet from space. However, the descriptions of these datasets are often published in scientific papers as PDF files with several limitations:</p><ul><li>The target audience is typically domain experts familiar with scientific jargon;</li> <li>The work is required to adhere to a specific page limit;</li> <li>Once the document is published, it is difficult to update sections or to centralize discussions around it.&#160;</li> </ul><p>To overcome these issues, here we introduce the concept of interactive dataset websites that aim at making the dataset and research based on it more accessible. With visual and interactive examples, users can see exactly how the data is structured and how the data can be used in different contexts. For example, when working with RS data, it is beneficial to get a quick overview of the geographical distribution. By providing more in-depth background information about data sources and product specifications, these websites can also help users understand the context in which the data was collected, how it might be relevant to their work, and how to avoid common pitfalls. Another important aspect of interactive dataset websites is the inclusion of example code for using, loading, and visualizing the data. Especially when working with RS images (e.g., multispectral, hyperspectral, synthetic aperture radar data, etc), it is often not trivial to visualize the data. Providing example code can be especially useful for researchers unfamiliar with the specific tools required to work with the data or to introduce to the community tools specifically written to make it easier to work with the dataset. Quick feedback can be vital, as it allows researchers to report problems or ask questions that the authors or community can address in an open and centralized manner. Creating these "living, ever-evolving documents" makes them an increasingly valuable resource for anyone working with the dataset, leading to more robust and reliable research.</p><p>It might seem daunting at first to create such an interactive dataset website, but due to recent open-source projects such as Executable Books (https://executablebooks.org/) and free hosting providers such as GitHub Pages (https://pages.github.com/), it has become relatively easy to produce and host such websites. The HTML content can be generated from Jupyter Notebooks, a tool that many researchers and data scientists are familiar with. To provide an example, in our talk we will showcase an interactive dataset website for the BigEarthNet-MM dataset, which you can find here: https://docs.kai-tub.tech/ben-docs/</p>
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.