Traditional urban scene-classification approaches focus on images taken either by satellite or in aerial view. Although single-view images are able to achieve satisfactory results for scene classification in most situations, the complementary information provided by other image views is needed to further improve performance. Therefore, we present a complementary information-learning model (CILM) to perform multi-view scene classification of aerial and ground-level images. Specifically, the proposed CILM takes aerial and ground-level image pairs as input to learn view-specific features for later fusion to integrate the complementary information. To train CILM, a unified loss consisting of cross entropy and contrastive losses is exploited to force the network to be more robust. Once CILM is trained, the features of each view are extracted via the two proposed feature-extraction scenarios and then fused to train the support vector machine classifier for classification. The experimental results on two publicly available benchmark data sets demonstrate that CILM achieves remarkable performance, indicating that it is an effective model for learning complementary information and thus improving urban scene classification.
Abstract. Scene classification plays an important role in remote sensing field. Traditional approaches use high-resolution remote sensing images as data source to extract powerful features. Although these kind of methods are common, the model performance is severely affected by the image quality of the dataset, and the single modal (source) of images tend to cause the mission of some scene semantic information, which eventually degrade the classification accuracy. Nowadays, multi-modal remote sensing data become easy to obtain since the development of remote sensing technology. How to carry out scene classification of cross-modal data has become an interesting topic in the field. To solve the above problems, this paper proposes using feature fusion for cross-modal scene classification of remote sensing image, i.e., aerial and ground street view images, expecting to use the advantages of aerial images and ground street view data to complement each other. Our cross- modal model is based on Siamese Network. Specifically, we first train the cross-modal model by pairing different sources of data with aerial image and ground data. Then, the trained model is used to extract the deep features of the aerial and ground image pair, and the features of the two perspectives are fused to train a SVM classifier for scene classification. Our approach has been demonstrated using two public benchmark datasets, AiRound and CV-BrCT. The preliminary results show that the proposed method achieves state-of-the-art performance compared with the traditional methods, indicating that the information from ground data can contribute to aerial image classification.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.