Remote sensing products with high temporal and spatial resolution can be hardly obtained under the constrains of existing technology and cost. Therefore, the spatiotemporal fusion of remote sensing images has attracted considerable attention. Spatiotemporal fusion algorithms based on deep learning have gradually developed, but they also face some problems. For example, the amount of data affects the model’s ability to learn, and the robustness of the model is not high. The features extracted through the convolution operation alone are insufficient, and the complex fusion method also introduces noise. To solve these problems, we propose a multi-stream fusion network for remote sensing spatiotemporal fusion based on Transformer and convolution, called MSNet. We introduce the structure of the Transformer, which aims to learn the global temporal correlation of the image. At the same time, we also use a convolutional neural network to establish the relationship between input and output and to extract features. Finally, we adopt the fusion method of average weighting to avoid using complicated methods to introduce noise. To test the robustness of MSNet, we conducted experiments on three datasets and compared them with four representative spatiotemporal fusion algorithms to prove the superiority of MSNet (Spectral Angle Mapper (SAM) < 0.193 on the CIA dataset, erreur relative global adimensionnelle de synthese (ERGAS) < 1.687 on the LGC dataset, and root mean square error (RMSE) < 0.001 on the AHB dataset).
Popular deep-learning-based spatiotemporal fusion methods for creating high-temporal–high-spatial-resolution images have certain limitations. The reconstructed images suffer from insufficient retention of high-frequency information and the model suffers from poor robustness, owing to the lack of training datasets. We propose a dual-branch remote sensing spatiotemporal fusion network based on a selection kernel mechanism. The network model comprises a super-resolution network module, a high-frequency feature extraction module, and a difference reconstruction module. Convolution kernel adaptive mechanisms are added to the high-frequency feature extraction module and difference reconstruction module to improve robustness. The super-resolution module upgrades the coarse image to a transition image matching the fine image; the high-frequency feature extraction module extracts the high-frequency features of the fine image to supplement the high-frequency features for the difference reconstruction module; the difference reconstruction module uses the structural similarity for fine-difference image reconstruction. The fusion result is obtained by combining the reconstructed fine-difference image with the known fine image. The compound loss function is used to help network training. Experiments are carried out on three datasets and five representative spatiotemporal fusion algorithms are used for comparison. Subjective and objective evaluations validate the superiority of our proposed method.
Remote sensing images with high temporal and spatial resolutions play a crucial role in land surface-change monitoring, vegetation monitoring, and natural disaster mapping. However, existing technical conditions and cost constraints make it very difficult to directly obtain remote sensing images with high temporal and spatial resolution. Consequently, spatiotemporal fusion technology for remote sensing images has attracted considerable attention. In recent years, deep learning-based fusion methods have been developed. In this study, to improve the accuracy and robustness of deep learning models and better extract the spatiotemporal information of remote sensing images, the existing multi-stream remote sensing spatiotemporal fusion network MSNet is improved using dilated convolution and an improved transformer encoder to develop an enhanced version called EMSNet. Dilated convolution is used to extract time information and reduce parameters. The improved transformer encoder is improved to further adapt to image-fusion technology and effectively extract spatiotemporal information. A new weight strategy is used for fusion that substantially improves the prediction accuracy of the model, image quality, and fusion effect. The superiority of the proposed approach is confirmed by comparing it with six representative spatiotemporal fusion algorithms on three disparate datasets. Compared with MSNet, EMSNet improved SSIM by 15.3% on the CIA dataset, ERGAS by 92.1% on the LGC dataset, and RMSE by 92.9% on the AHB dataset.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.