Dual-Branch Two-Stage Detection for Small Objects in UAV Images
YANG Yi1,2, ZHU Jiangrui1,2, WANG Keping1,2, ZHANG Gaopeng3, QIAN Wei1,2, WANG Tian4
1. School of Electrical Engineering and Automation, Henan Polytechnic University, Jiaozuo 454003; 2. Henan International Joint Laboratory of Direct Drive and Control of Intelligent Equipment, Henan Polytechnic University, Jiaozuo 454003; 3. Xi'an Key Laboratory of Aircraft Optical Imaging and Mea-surement Technology, Xi'an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi'an 710119; 4. School of Artificial Intelligence, Beihang University, Beijing 100083
Abstract:The excessively small size of targets in images is a major challenge for drone-based object detection. Particularly when drones operate at high altitudes with low imaging resolution, features of small targets are prone to dissipation within the deep layers of deep neural networks. To address this issue, a method of dual-branch two-stage detection for small objects in unmanned aerial vehicle(DB-TS) is proposed. The parallel tasks consist of a small object detection task and a super-resolution reconstruction task. In the super-resolution reconstruction branch, a spatial prior module(SPM) and a window attention module(WAM) are constructed. The small object detection branch is built upon the Swin Transformer backbone. The spatial information from shallow features and the attention guidance from deep features are reconstructed via super-resolution methods of SPM and WAM, respectively. The two-stage detection framework consists of a training phase and an inference phase. During the training phase, the fine-grained detail extraction capability of the small object detection branch is strengthened by using the high-resolution features as ground truth in the super-resolution reconstruction branch. During the inference phase, inference speed is significantly improved and computational resource consumption is reduced by retaining only the small object detection branch.Experiments on VisDrone and JZ-UAV datasets demonstrate that the proposed method achieves higher recognition accuracy compared to baseline models and exhibits superior performance among compared state-of-the-art methods.
[1] LIANG W D, TAN J T, HE H J, et al. Detection of Small Objects from UAV Imagery via an Improved Swin Transformer // Proc of the IEEE International Geoscience and Remote Sensing Symposium. Washington, USA: IEEE, 2024: 9134-9138. [2] 王洪群,彭嘉雄,李玲玲.基于视觉的无人机着陆时机场标记的检测与识别.模式识别与人工智能, 2006, 19(6): 764-770. (WANG H Q, PENG J X, LI L L.Airport Runway Marking Detection and Identification of Unmanned Landing Vehicle Based on Vision. Pattern Recognition and Artificial Intelligence, 2006, 19(6): 764-770.) [3] CUI L S, MA R, LÜ P W, et al. MDSSD: Multi-scale Deconvolutional Single Shot Detector for Small Objects. Science China(Information Sciences), 2020, 63(2). DOI: 10.1007/s11432-019-2723-1. [4] BELL S, ZITNICK C L, BALA K, et al. Inside-Outside Net: Detecting Objects in Context with Skip Pooling and Recurrent Neural Networks // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2016: 2874-2883. [5] WANG G T, XIONG Z W, LIU D, et al. Cascade Mask Generation Framework for Fast Small Object Detection // Proc of the IEEE International Conference on Multimedia and Expo. Washington, USA: IEEE, 2018. DOI: 10.1109/ICME.2018.8486561. [6] BAI Y C, ZHANG Y Q, DING M L, et al. SOD-MTGAN: Small Object Detection via Multi-task Generative Adversarial Network // Proc of the 15th European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 210-226. [7] CHEN J Q, CHEN K Y, CHEN H, et al. A Degraded Reconstruction Enhancement-Based Method for Tiny Ship Detection in Remote Sensing Images with a New Large-Scale Dataset. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60. DOI: 10.1109/TGRS.2022.3180894. [8] NOH J, BAE W, LEE W, et al. Better to Follow, Follow to Be Better: Towards Precise Supervision of Feature Super-Resolution for Small Object Detection // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2019: 9724-9733. [9] LIU Z, LIN Y T, CAO Y, et al. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2021: 9992-10002. [10] O'SHEA K, NASH R. An Introduction to Convolutional Neural Net-works[C/OL].[2025-11-21]. https://arxiv.org/abs/1511.08458. [11] LIN T, DOLLÁR P, GIRSHICK R, et al. Feature Pyramid Networks for Object Detection // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2017: 936-944. [12] REDMON J, DIVVALA S, GIRSHICK R, et al. You Only Look Once: Unified, Real-Time Object Detection // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2016: 779-788. [13] REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149. [14] LIU W, ANGUELOV D, ERHAN D, et al. SSD: Single Shot Multi-Box Detector // Proc of the 14th European Conference on Computer Vision. Berlin, Germany: Springer, 2016: 21-37. [15] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale[C/OL].[2025-11-21]. https://arxiv.org/pdf/2010.11929v2. [16] VASWANI A, SHAZEER N, PARMAR N, et al.Attention Is All You Need // Proc of the 31st International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2017: 6000-6010. [17] SHERMEYER J, VAN ETTEN A.The Effects of Super-Resolution on Object Detection Performance in Satellite Imagery // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Washington, USA: IEEE, 2019: 1432-1441. [18] CAO Y, XU J R, LIN S, et al. GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond // Proc of the IEEE/CVF International Conference on Computer Vision Workshop. Washington, USA: IEEE, 2019: 1971-1980. [19] TAN M X, PANG R M, LE Q V.EfficientDet: Scalable and Efficient Object Detection // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2020: 10778-10787. [20] ZHANG H K, CHANG H, MA B P, et al. Dynamic R-CNN: Towards High Quality Object Detection via Dynamic Training // Proc of the 16th European Conference on Computer Vision. Berlin, Germany: Springer, 2020: 260-275. [21] DONG C, LOY C C, HE K M, et al. Image Super-Resolution Using Deep Convolutional Networks[C/OL].[2025-11-18]. https://arxiv.org/pdf/1501.00092. [22] WANG X T, YU K, WU S X, et al. ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks // Proc of the 15th European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 63-75. [23] ZHANG Y L, LI K P, LI K, et al. Image Super-Resolution Using Very Deep Residual Channel Attention Networks // Proc of the 15th European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 294-310. [24] GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al.Gene-rative Adversarial Networks // Proc of the 28th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2014, II: 2672-2680. [25] LI J N, LIANG X D, WEI Y C, et al. Perceptual Generative Adversarial Networks for Small Object Detection // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2017: 1951-1959. [26] KOESTER E, ŞAHIN C Ş.A Comparison of Super-Resolution and Nearest Neighbors Interpolation Applied to Object Detection on Sa-tellite Data[C/OL]. [2025-11-22].https://arxiv.org/pdf/1907.05283. [27] HARIS M, SHAKHNAROVICH G, UKITA N.Task-Driven Super Resolution: Object Detection in Low-Resolution Images // Proc of the 28th International Conference on Neural Information Process. Berlin, Germany: Springer, 2021: 387-395. [28] MUSUNURI Y R, KWON O, KUNG S.SRODNet: Object Detection Network Based on Super Resolution for Autonomous Vehicles. Remote Sensing, 2022, 14(24). DOI: 10.3390/rs14246270. [29] LIU F, CHEN R W, ZHANG J Y, et al. ESRTMDet: An End-to-End Super-Resolution Enhanced Real-Time Rotated Object Detector for Degraded Aerial Images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2023, 16: 4983-4998. [30] GAO Y X, WANG Y C, ZHANG Y X, et al. Feature Super-Resolution Fusion with Cross-Scale Distillation for Small-Object Detection in Optical Remote Sensing Images. IEEE Geoscience and Remote Sensing Letters, 2024, 21. DOI: 10.1109/LGRS.2024.3372500. [31] ZHANG J Q, LEI J, XIE W Y, et al. SuperYOLO: Super Resolution Assisted Object Detection in Multimodal Remote Sensing Ima-gery. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61. DOI: 10.1109/TGRS.2023.3258666. [32] ZHANG H P, WEN S Z, WEI Z X, et al. High-Resolution Feature Generator for Small-Ship Detection in Optical Remote Sensing Images. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62. DOI: 10.1109/TGRS.2024.3377999. [33] VARGHESE R, SAMBATH M.YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness // Proc of the International Conference on Advances in Data Engineering and Intelligent Computing Systems. Washington, USA: IEEE, 2024. DOI: 10.1109/ADICS58448.2024.10533619. [34] LIU S L, ZENG Z Y, REN T H, et al. Grounding DINO: Ma-rrying DINO with Grounded Pre-training for Open-Set Object Detection // Proc of the 18th European Conference on Computer Vision. Berlin, Germany: Springer, 2024: 38-55. [35] DU D W, ZHU P F, WEN L Y, et al. VisDrone-DET2019: The Vision Meets Drone Object Detection in Image Challenge Results // Proc of the IEEE/CVF International Conference on Computer Vision Workshop. Washington, USA: IEEE, 2019: 213-226. [36] WOO S, DEBNATH S, HU R H, et al. ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023: 16133-16142. [37] LIU Z, MAO H Z, WU C Y, et al. A ConvNet for the 2020s // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 11966-11976. [38] CHEN X Q, YANG C Z, MO J S, et al. CSPNeXt: A New Efficient Token Hybrid Backbone. Engineering Applications of Artificial Intelligence, 2024, 132. DOI: 10.1016/j.engappai.2024.107886. [39] WANG C, LIAO H M, WU Y, et al. CSPNet: A New Backbone That Can Enhance Learning Capability of CNN // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Washington, USA: IEEE, 2020: 1571-1580. [40] ZHANG H, LI F, LIU S L, et al. DINO: DETR with Improved Denoising Anchor Boxes for End-to-End Object Detection[C/OL].[2025-11-22]. https://arxiv.org/pdf/2203.03605. [41] LÜ C Q, ZHANG W A, HUANG H, et al. RTMDet: An Empirical Study of Designing Real-Time Object Detectors[C/OL].[2025-11-22]. https://arxiv.org/pdf/2212.07784. [42] CHEN S F, SUN P Z, SONG Y B, et al. DiffusionDet: Diffusion Model for Object Detection // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2023: 19773-19786.