Abstract:Infrared small object detection has become a crucial research area in aerial surveillance systems, particularly due to the growing strategic importance of detecting aerial drones as autonomous drone technology advances. Recent developments in computer vision have led to various deep learning-based detection algorithms, primarily relying on appearance-based features. However, existing methods struggle with targets that have weak textural patterns and chromatic deficiencies. To overcome these challenges, a novel multi-frame ensemble prediction framework for infrared small target detection in aerial drones was introduced in this paper. The proposed architecture consists of three key phases: A ResNet-50 backbone network extracts frame-level deep features from input sequences; a pixel decoder, enhanced with multi-scale deformable attention mechanisms (analyzing C5/C4/C3 hierarchical features) and bicubic interpolation operations, improves spatial resolution and enhances the features of small targets; a dual-decoder structure, including a frame decoder (using learnable query vectors) and a target decoder (employing a Vision Transformer), collaboratively establishes spatial-temporal correlations to generate video-level query vectors capturing spatiotemporal masks of target instances. During training, the Hungarian algorithm optimizes bipartite matching between predictions and ground truth annotations. This, combined with joint optimization of classification loss, mask loss, and similarity loss, enables end-to-end learning. For inference, the system implements video segmentation using adaptive mask fusion from high-confidence queries, along with frame difference techniques for static background suppression. Experiments on the DSAT dataset demonstrate the superiority of this approach, achieving a precision of 0.6356 and an F-score of 0.6475, significantly outperforming existing methods in complex backgrounds. These results highlight the effectiveness of the proposed framework in accurately detecting small infrared targets in challenging environments.