Abstract:To avoid the accumulation of estimation errors from explicitly aligning multi-frame features in current infrared small-dim target detection algorithms, and to alleviate the loss of target features due to network downsampling, a progressive spatio-temporal feature fusion network is proposed. The network utilizes a progressive temporal feature accumulation module to implicitly aggregate multi-frame information and utilizes a multi-scale spatial feature fusion module to enhance the interaction between shallow detail features and deep semantic features. Due to the scarcity of multi-frame infrared dim target datasets, a highly realistic semi-synthetic dataset is constructed. Compared to the mainstream algorithms, the proposed algorithm improves the probability of detection by 4.69% and 4.22% on the proposed dataset and the public dataset, respectively.