We address the problem of cross-modal information retrieval in the domain of remote sensing. In particular, we are interested in two application scenarios: i) crossmodal retrieval between panchromatic (PAN) and multi-spectral imagery, and ii) multi-label image retrieval between very high resolution (VHR) images and speech based label annotations. These multi-modal retrieval scenarios are more challenging than the traditional uni-modal retrieval approaches given the inherent differences in distributions between the modalities. However, with the increasing availability of multi-source remote sensing data and the scarcity of enough semantic annotations, the task of multi-modal retrieval has recently become extremely important. In this regard, we propose a novel deep neural network based architecture which is considered to learn a discriminative shared feature space for all the input modalities, suitable for semantically coherent information retrieval. Extensive experiments are carried out on the benchmark large-scale PAN -multi-spectral DSRSID dataset and the multi-label UC-Merced dataset. Together with the Merced dataset, we generate a corpus of speech signals corresponding to the labels. Superior performance with respect to the current state-of-the-art is observed in all the cases.
IntroductionWith the availability of extensive data collections, and with efficient storage facilities, the recent time has witnessed the accumulation of an enormous volume of remote sensing data. These data represent different modalities regarding the same underlying phenomenon. This has, in turn, led to the requirement for shifting the paradigm from the traditional uni-modal to the more challenging cross-modal information extraction scenario. Ideally, given a query from one domain, the cross-modal approaches retrieve relevant information from the other domains. In this regard, the necessity of the multi-modal approaches is especially more evident in the area of remote sensing (RS) data analysis, due to the availability of a large number of satellite missions. As an example, given a panchromatic query image, we may want to retrieve all semantically consistent multi-spectral images captured by a different satellite sensor for better analysis in the spectral domain. In gist, the paradigm of cross-modal information retrieval is critical in the area of RS given the complementary information captured in different cross-sensor acquisitions.Although the notion of image retrieval has received extensive attention, several other cross-modal combinations are tried in the past, specifically in the area of computer vision and natural language processing: image -text (label) pairs [1,2], RBG image -depth image pairs [3], etc. While most of the recent endeavors in this respect follow a closed-form formulation to associate the samples from different domains directly [1,4], others [5, 2, 6] depend on the conventional learning-based approaches. In contrast to the single-label based image retrieval, few strategies extend the framework to support multiple...