CMIR-NET : A deep learning based model for cross-modal retrieval in remote sensing

Chaudhuri, Ushasi; Banerjee, Biplab; Bhattacharya, Avik; Datcu, Mihai

doi:10.1016/j.patrec.2020.02.006

Cited by 57 publications

(44 citation statements)

References 18 publications

(33 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…By definition, the task of image fusion aims at synergistically combining images from different related modalities to generate a merged representation of the information present in the images, improving visual inference performance over the individual images. Growing interest from the multimedia community is reflected in various works like [21] where audio-visual crossmodal representation learning was proposed, in [22] where RGB-depth multimodal features were fused for scene classification and in shared cross modal image retrieval [23]. It is also an emerging topic in medical image classification.…”

Section: Related Workmentioning

confidence: 99%

FusAtNet: Dual Attention based SpectroSpatial Multimodal Fusion Network for Hyperspectral and LiDAR Classification

Mohla

Pande

Banerjee

et al. 2020

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

Self Cite

134

View full text Add to dashboard Cite

With recent advances in sensing, multimodal data is becoming easily available for various applications, especially in remote sensing (RS), where many data types like multispectral imagery (MSI), hyperspectral imagery (HSI), Li-DAR etc. are available. Effective fusion of these multisource datasets is becoming important, for these multimodality features have been shown to generate highly accurate land-cover maps. However, fusion in the context of RS is non-trivial considering the redundancy involved in the data and the large domain differences among multiple modalities. In addition, the feature extraction modules for different modalities hardly interact among themselves, which further limits their semantic relatedness. As a remedy, we propose a feature fusion and extraction framework, namely FusAtNet, for collective land-cover classification of HSIs and LiDAR data in this paper. The proposed framework effectively utilizses HSI modality to generate an attention map using "self-attention" mechanism that highlights its own spectral features. Similarly, a "crossattention" approach is simultaneously used to harness the LiDAR derived attention map that accentuates the spatial features of HSI. These attentive spectral and spatial representations are then explored further along with the original data to obtain modality-specific feature embeddings. The modality oriented joint spectro-spatial information thus obtained, is subsequently utilized to carry out the land-cover classification task. Experimental evaluations on three HSI-LiDAR datasets show that the proposed method achieves the state-of-the-art classification performance, including on the largest HSI-LiDAR dataset available, University of Houston (Data Fusion Contest-2013), opening new avenues in multimodal feature fusion for classification. * Equal Contribution Corresponding Author Classification Module Fusion Black Box HSI (Visible shown) LiDAR Classification Map Fused Joint Representation Figure 1. Generic schematic of a multimodal fusion based classification task. The objective is to effectively combine the two modalities (hereby HSI and LiDAR) such that the resultant representation has rich, fused features that are relevant and robust enough for accurate classification.

show abstract

Section: Related Workmentioning

confidence: 99%

FusAtNet: Dual Attention based SpectroSpatial Multimodal Fusion Network for Hyperspectral and LiDAR Classification

Mohla

Pande

Banerjee

et al. 2020

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

Self Cite

134

View full text Add to dashboard Cite

show abstract

“…In the same way, multi-source data can be fused using the CMIR-NET (e.g.) that learns from two separate, but labelled data sets [62]. Compared to other techniques, a high performance can only be achieved with a very large amount of data.…”

Section: Multi-sar System With Image Fusionmentioning

confidence: 99%

Multi-Source and Multi-Temporal Image Fusion on Hypercomplex Bases

et al. 2020

View full text Add to dashboard Cite

This article spanned a new, consistent framework for production, archiving, and provision of analysis ready data (ARD) from multi-source and multi-temporal satellite acquisitions and an subsequent image fusion. The core of the image fusion was an orthogonal transform of the reflectance channels from optical sensors on hypercomplex bases delivered in Kennaugh-like elements, which are well-known from polarimetric radar. In this way, SAR and Optics could be fused to one image data set sharing the characteristics of both: the sharpness of Optics and the texture of SAR. The special properties of Kennaugh elements regarding their scaling—linear, logarithmic, normalized—applied likewise to the new elements and guaranteed their robustness towards noise, radiometric sub-sampling, and therewith data compression. This study combined Sentinel-1 and Sentinel-2 on an Octonion basis as well as Sentinel-2 and ALOS-PALSAR-2 on a Sedenion basis. The validation using signatures of typical land cover classes showed that the efficient archiving in 4 bit images still guaranteed an accuracy over 90% in the class assignment. Due to the stability of the resulting class signatures, the fuzziness to be caught by Machine Learning Algorithms was minimized at the same time. Thus, this methodology was predestined to act as new standard for ARD remote sensing data with an subsequent image fusion processed in so-called data cubes.

show abstract

“…Feature extraction with deep learning-based methods is found in several applications with remote sensing imagery [10][11][12][13][14][15][16][17][18]. These deep networks are built with different types of architectures that follow a hierarchical type of learning.…”

Section: Introductionmentioning

confidence: 99%

ATSS Deep Learning-Based Approach to Detect Apple Fruits

et al. 2020

View full text Add to dashboard Cite

In recent years, many agriculture-related problems have been evaluated with the integration of artificial intelligence techniques and remote sensing systems. Specifically, in fruit detection problems, several recent works were developed using Deep Learning (DL) methods applied in images acquired in different acquisition levels. However, the increasing use of anti-hail plastic net cover in commercial orchards highlights the importance of terrestrial remote sensing systems. Apples are one of the most highly-challenging fruits to be detected in images, mainly because of the target occlusion problem occurrence. Additionally, the introduction of high-density apple tree orchards makes the identification of single fruits a real challenge. To support farmers to detect apple fruits efficiently, this paper presents an approach based on the Adaptive Training Sample Selection (ATSS) deep learning method applied to close-range and low-cost terrestrial RGB images. The correct identification supports apple production forecasting and gives local producers a better idea of forthcoming management practices. The main advantage of the ATSS method is that only the center point of the objects is labeled, which is much more practicable and realistic than bounding-box annotations in heavily dense fruit orchards. Additionally, we evaluated other object detection methods such as RetinaNet, Libra Regions with Convolutional Neural Network (R-CNN), Cascade R-CNN, Faster R-CNN, Feature Selective Anchor-Free (FSAF), and High-Resolution Network (HRNet). The study area is a highly-dense apple orchard consisting of Fuji Suprema apple fruits (Malus domestica Borkh) located in a smallholder farm in the state of Santa Catarina (southern Brazil). A total of 398 terrestrial images were taken nearly perpendicularly in front of the trees by a professional camera, assuring both a good vertical coverage of the apple trees in terms of heights and overlapping between picture frames. After, the high-resolution RGB images were divided into several patches for helping the detection of small and/or occluded apples. A total of 3119, 840, and 2010 patches were used for training, validation, and testing, respectively. Moreover, the proposed method’s generalization capability was assessed by applying simulated image corruptions to the test set images with different severity levels, including noise, blurs, weather, and digital processing. Experiments were also conducted by varying the bounding box size (80, 100, 120, 140, 160, and 180 pixels) in the image original for the proposed approach. Our results showed that the ATSS-based method slightly outperformed all other deep learning methods, between 2.4% and 0.3%. Also, we verified that the best result was obtained with a bounding box size of 160 × 160 pixels. The proposed method was robust regarding most of the corruption, except for snow, frost, and fog weather conditions. Finally, a benchmark of the reported dataset is also generated and publicly available.

show abstract

CMIR-NET : A deep learning based model for cross-modal retrieval in remote sensing

Cited by 57 publications

References 18 publications

FusAtNet: Dual Attention based SpectroSpatial Multimodal Fusion Network for Hyperspectral and LiDAR Classification

FusAtNet: Dual Attention based SpectroSpatial Multimodal Fusion Network for Hyperspectral and LiDAR Classification

Multi-Source and Multi-Temporal Image Fusion on Hypercomplex Bases

ATSS Deep Learning-Based Approach to Detect Apple Fruits

Contact Info

Product

Resources

About