Multimodal remote sensing image fusion based on self-supervised pre-training and cross-scale contrastive learning
CSTR:
Author:
Affiliation:

1.Key Laboratory for Information Science of Electromagnetic Waves (MoE), Fudan University,Shanghai 200433, China;2.Image and Intelligence Laboratory, School of Information Science and Technology, Fudan University,Shanghai 200433, China

Clc Number:

TP751

Fund Project:

Supported by the National Natural Science Foundation of China (62371140) and National Key Research and Development Program of China (2022YFB3903404)

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    Self-supervised pre-training methods have strong capabilities in feature extraction and model transfer. However, current pre-training methods in multimodal remote sensing image (RSI) fusion only perform simple fusion operations such as concatenation on the extracted multimodal features without designing dedicated modules for the integration of multimodal information, leading to insufficient fusion of complementary information across modalities. Secondly, these methods do not consider and utilize the cross-scale consistency priors within RSIs, resulting in limited extraction and integration of multimodal remote sensing information, and thus the performance of various downstream tasks needs to be improved. In response to the above issues, a multimodal RSI fusion method based on self-supervised pre-training and cross-scale contrastive learning is proposed, which mainly includes three parts: 1) by introducing a cross-attention fusion mechanism to preliminarily integrate features extracted from different modalities, and then using encoder modules to further extract features, explicit aggregation and extraction of complementary information from each modality are achieved; 2) by introducing a cross-modality fusion mechanism, each modality can extract useful supplementary information from the features of all modalities, and reconstruct each modality’s input after separate decoding; 3) based on the cross-scale consistency constraints of RSIs, cross-scale contrastive learning is introduced to enhance the extraction of single-modality information, achieving more robust pre-training. Experimental results on multiple public multimodal RSI fusion datasets demonstrate that, compared with existing methods, the proposed algorithm has achieved significant performance improvements in various downstream tasks. On the Globe230k dataset, our method achieves an average intersection over union (mIoU) of 79.01%, an overall accuracy (OA) of 92.56%, and an average F1 score (mF1) of 88.05%, and it has the advantages of good scalability and easy hyperparameter setting.

    Reference
    Related
    Cited by
Get Citation

LI Zhao-Wei, FENG Shi-Yang, WANG Bin. Multimodal remote sensing image fusion based on self-supervised pre-training and cross-scale contrastive learning[J]. Journal of Infrared and Millimeter Waves,2025,44(4):520~533

Copy
Related Videos

Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:November 03,2024
  • Revised:May 13,2025
  • Adopted:December 12,2024
  • Online: May 12,2025
  • Published:
Article QR Code