Abstract:Self-supervised pre-training methods have strong capabilities in feature extraction and model transfer. However, current pre-training methods in multimodal remote sensing image (RSI) fusion only perform simple fusion operations such as concatenation on the extracted multimodal features without designing dedicated modules for the integration of multimodal information, leading to insufficient fusion of complementary information across modalities. Secondly, these methods do not consider and utilize the cross-scale consistency priors within RSIs, resulting in limited extraction and integration of multimodal remote sensing information, and thus the performance of various downstream tasks needs to be improved. In response to the above issues, a multimodal RSI fusion method based on self-supervised pre-training and cross-scale contrastive learning is proposed, which mainly includes three parts: 1) By introducing a cross-attention fusion mechanism to preliminarily integrate features extracted from different modalities, and then using encoder modules to further extract features, explicit aggregation and extraction of complementary information from each modality are achieved; 2) By introducing a cross-modality fusion mechanism, each modality can extract useful supplementary information from the features of all modalities, and reconstruct each modality’s input after separate decoding; 3) Based on the cross-scale consistency constraints of RSIs, cross-scale contrastive learning is introduced to enhance the extraction of single-modality information, achieving more robust pre-training. Experimental results on multiple public multimodal RSI fusion datasets demonstrate that, compared with existing methods, the proposed algorithm has achieved significant performance improvements in various downstream tasks. On the Globe230k dataset, our method achieves an average intersection over union (mIoU) of 79.01%, an overall accuracy (OA) of 92.56%, and an average F1 score (mF1) of 88.05%, and it has the advantages of good scalability and easy hyperparameter setting.