Multimodal remote sensing image fusion based on self-supervised pre-training and cross-scale contrastive learning

doi:10.11972/j.issn.1001-9014.2025.04.006

Home > Archive>Volume 44, Issue 4, 2025 >520-533. DOI:10.11972/j.issn.1001-9014.2025.04.006

Multimodal remote sensing image fusion based on self-supervised pre-training and cross-scale contrastive learning
DOI:
                        10.11972/j.issn.1001-9014.2025.04.006
                    
CSTR:
                        
Author:
                        
Affiliation:1.Key Laboratory for Information Science of Electromagnetic Waves (MoE)， Fudan University，Shanghai 200433， China;2.Image and Intelligence Laboratory， School of Information Science and Technology， Fudan University，Shanghai 200433， China
Clc Number:TP751
Fund Project:Supported by the National Natural Science Foundation of China （62371140） and National Key Research and Development Program of China （2022YFB3903404）

Article

Figures

Metrics

Reference

Cited by

Materials

Comments

Abstract:

Self-supervised pre-training methods have strong capabilities in feature extraction and model transfer. However， current pre-training methods in multimodal remote sensing image （RSI） fusion only perform simple fusion operations such as concatenation on the extracted multimodal features without designing dedicated modules for the integration of multimodal information， leading to insufficient fusion of complementary information across modalities. Secondly， these methods do not consider and utilize the cross-scale consistency priors within RSIs， resulting in limited extraction and integration of multimodal remote sensing information， and thus the performance of various downstream tasks needs to be improved. In response to the above issues， a multimodal RSI fusion method based on self-supervised pre-training and cross-scale contrastive learning is proposed， which mainly includes three parts： 1） by introducing a cross-attention fusion mechanism to preliminarily integrate features extracted from different modalities， and then using encoder modules to further extract features， explicit aggregation and extraction of complementary information from each modality are achieved； 2） by introducing a cross-modality fusion mechanism， each modality can extract useful supplementary information from the features of all modalities， and reconstruct each modality’s input after separate decoding； 3） based on the cross-scale consistency constraints of RSIs， cross-scale contrastive learning is introduced to enhance the extraction of single-modality information， achieving more robust pre-training. Experimental results on multiple public multimodal RSI fusion datasets demonstrate that， compared with existing methods， the proposed algorithm has achieved significant performance improvements in various downstream tasks. On the Globe230k dataset， our method achieves an average intersection over union （mIoU） of 79.01%， an overall accuracy （OA） of 92.56%， and an average F1 score （mF1） of 88.05%， and it has the advantages of good scalability and easy hyperparameter setting.

Reference

Cited by

Get Citation

LI Zhao-Wei, FENG Shi-Yang, WANG Bin. Multimodal remote sensing image fusion based on self-supervised pre-training and cross-scale contrastive learning[J]. Journal of Infrared and Millimeter Waves,2025,44(4):520~533

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:November 03,2024
Revised:May 13,2025
Adopted:December 12,2024
Online: May 12,2025
Published:

Home

Introduction

Editorial Board

Authors' Guidelines

Open Access

Publishing Ethics

Downloads

Copyright

Contact Us

中文版

Get Citation

Related Videos

Share

Article Metrics

History

Article QR Code