Abstract Human pose estimation methods are categorized into coordinate regression-based methods and heatmap-based methods. Coordinate regression-based methods are characterized by slightly faster inference speed but slightly lower accuracy, while heatmap-based methods can achieve precise localization at the cost of higher computational and storage overhead. Therefore, a human pose estimation method based on knowledge distillation and dynamic region refinement is proposed. First, the information from the heatmap model is transferred to the regression model through feature distillation and pose distillation. Then, the features extracted by multi-layer Transformer are selected to generate initial pose estimation in the coarse stage, and the image features that need to be refined are selected based on the scores from a quality predictor. Finally, in the refinement stage, fine-grained representations or refined features, are established in the regions related to some keypoints according to the correlation between keypoints and image regions, achieving human pose refinement. Experiments on COCO and COCO-WholeBody datasets demonstrate that the proposed method can accurately locate keypoints and achieve accurate human pose estimation.
Fund:National Natural Science Foundation of China(No.61603357)
Corresponding Authors:
WEI Longsheng, Ph.D., associate professor. His research interests include computer vision and pattern recognition.
About author:: FU Xingpeng, Master student. His research interests include deep learning and human pose estimation. LI Tangqiang, Master student. His research interests include deep learning and image segmentation. HUANG Haoyu, Master student. His research interests include deep learning and no-reference image quality assessment.
WEI Longsheng,FU Xingpeng,LI Tangqiang等. Human Pose Estimation Based on Knowledge Distillation and Dynamic Region Refinement[J]. Pattern Recognition and Artificial Intelligence, 2025, 38(2): 164-176.
[1] SHI J J, ZHANG F C, MA Z N. Fusing CNN and Transformer Network for Human Pose Estimation. Advances in Computer, Signals and Systems, 2024, 8(5): 174-184. [2] ZHENG C, WU W H, CHEN C, et al. Deep Learning-Based Human Pose Estimation: A Survey. ACM Computing Surveys, 2023, 56(1). DOI: 10.1145/3603618. [3] ZHOU X H, LI S, LIU J P,et al. Construction Activity Analysis of Workers Based on Human Posture Estimation Information. Enginee-ring, 2024, 33(2): 225-236. [4] CARREIRA J, AGRAWAL P, FRAGKIADAKI K, et al. Human Pose Estimation with Iterative Error Feedback // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2016: 4733-4742. [5] LUVIZON D C, TABIA H, PICARD D. Human Pose Regression by Combining Indirect Part Detection and Contextual Information. Computers and Graphics, 2019, 85: 15-22. [6] SUN K, XIAO B, LIU D, et al. Deep High-Resolution Representation Learning for Human Pose Estimation // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2019: 5686-5696. [7] CAO Z, SIMON T, WEI S E, et al. Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields // Proc of the IEEE Confe-rence on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2017: 1302-1310. [8] MUNEA T L, JEMBRE Y Z, WELDEGEBRIEL H T, et al. The Progress of Human Pose Estimation: A Survey and Taxonomy of Models Applied in 2D Human Pose Estimation. IEEE Access, 2020, 8: 133330-133348. [9] 闫忠心,白琳,李陶深. 融合自我知识蒸馏和卷积压缩的轻量化人体姿态估计方法.小型微型计算机系统, 2024, 45(2): 461-469. (YAN Z X, BAI L, LI T S. Lightweight Human Pose Estimation Based on Self-Knowledge Distillation and Convolution Compression. Journal of Chinese Computer Systems, 2024, 45(2): 461-469.) [10] ZHANG F, ZHU X T, YE M. Fast Human Pose Estimation // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2019: 3512-3521. [11] LI Y J, ZHANG S K, WANG Z C, et al. TokenPose: Learning Keypoint Tokens for Human Pose Estimation // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2021: 11293-11302. [12] LI K, ZHANG X Z, WANG S J, et al. Pose Recognition with Cascade Transformers // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2021: 1944-1953. [13] CAO G Y, MO W H, WU Z Z, et al. Research on Lightweight Human Pose Estimation Model Based on Knowledge Distillation // Proc of the International Conference on Networking and Network Applications. Washington, USA: IEEE, 2023: 156-162. [14] XU X X, ZOU Q, LIN X, et al. Integral Knowledge Distillation for Multi-person Pose Estimation. IEEE Signal Processing Letters, 2020, 27: 436-440. [15] KHAN S, NASEER M, HAYAT M, et al. Transformers in Vision: A Survey. ACM Computing Surveys, 2022, 54(10S). DOI: 10.1145/3505244. [16] XU Y F, ZHANG J, ZHANG Q M, et al. ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation // Proc of the 36th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2022: 38571-38584. [17] 赵云霄,钱宇华,王克琪. 基于可变形卷积的多人人体姿态估计.模式识别与人工智能, 2020, 33(10): 944-950. (ZHAO Y X, QIAN Y H, WANG K Q. Multi-person Human Pose Estimation Based on Deformable Convolution. Pattern Recognition and Artificial Intelligence, 2020, 33(10): 944-950.) [18] NIE X C, FENG J S, ZHANG J F, et al. Single-Stage Multi-person Pose Machines // Proc of the IEEE/CVF International Confe-rence on Computer Vision. Washington, USA: IEEE, 2019: 6950-6959. [19] LI J F, BIAN S Y, ZENG A L, et al. Human Pose Regression with Residual Log-Likelihood Estimation // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2021: 11005-11014. [20] CARION N, MASSA F, SYNNAEVE G, et al. End-to-End Object Detection with Transformers // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2020: 213-229. [21] SHI D H, WEI X, LI L Q, et al. End-to-End Multi-person Pose Estimation with Transformers // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 11059-11068. [22] ZHAO X M, GUO C C, ZOU Q. Human Pose Estimation with Gated Multi-scale Feature Fusion and Spatial Mutual Information. The Visual Computer, 2023, 39(1): 119-137. [23] ZHOU G Y, TENG X L, JO K H. Knowledge Distillation for Human Pose Estimation Using Channel Dropout Strategy. International Workshop on Intelligent Systems, 2023. DOI: 10.1109/IWIS58789.2023.10284595. [24] 王庭伟,赵建伟,周正华. 基于轻量级对称CNN-Transformer的图像超分辨率重建方法.模式识别与人工智能, 2024, 37(7): 626-637. (WANG T W, ZHAO J W, ZHOU Z H. Image Super-Resolution Reconstruction Method Based on Lightweight Symmetric CNN-Transformer. Pattern Recognition and Artificial Intelligence, 2024, 37(7): 626-637.) [25] LIU Z, LIN Y T, CAO Y, et al. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2021: 9992-10002. [26] YANG S, QUAN Z B, NIE M, et al. TransPose: Keypoint Localization via Transformer // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2021: 11782-11792. [27] WEI F L, HU X F. A Lightweight Attention-Driven Distillation Model for Human Pose Estimation. Pattern Recognition Letters, 2024, 185: 247-253. [28] YUAN X, FEI H L, BAEK J. Efficient Transformer Adaptation with Soft Token Merging // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Washington, USA: IEEE, 2024: 3658-3668. [29] ARAÚJO D J, VERDELHO M R, BISSOTO A, et al. Key Patches Are All You Need: A Multiple Instance Learning Framework for Robust Medical Diagnosis // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Washington, USA: IEEE, 2024: 5231-5240. [30] TANG S T, ZHANG J H, ZHU S Y, et al. Quadtree Attention for Vision Transformers[C/OL].[2024-10-17]. https://arxiv.org/pdf/2201.02767. [31] CHEN M Z, LIN M B, LI K, et al. CF-ViT: A General Coarse-to-Fine Method for Vision Transformer. Proceedings of the AAAI Conference on Artificial Intelligence, 2023, 37(6): 7042-7052. [32] AN X Q, ZHAO L, GONG C, et al. SHaRPose: Sparse High-Re-solution Representation for Human Pose Estimation. Proceedings of the AAAI Conference on Artificial Intelligence, 2024, 38(2): 691-699. [33] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: Common Objects in Context // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2014: 740-755. [34] FANG H S, LI J F, TANG H Y, et al. AlphaPose: Whole-Body Regional Multi-person Pose Estimation and Tracking in Real-Time. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(6): 7157-7173. [35] XIAO B, WU H P, WEI Y C. Simple Baselines for Human Pose Estimation and Tracking // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 472-487. [36] GENG Z G, SUN K, XIAO B, et al. Bottom-up Human Pose Estimation via Disentangled Keypoint Regression // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2021: 14671-14681. [37] YE S H, HONG Z B, ZHENG J W, et al. Improving Occluded Human Pose Estimation via Linked Joints // Proc of the IEEE International Conference on Acoustics, Speech and Signal Proce-ssing. Washington, USA: IEEE, 2023. DOI: 10.1109/ICASSP49357.2023.10097055. [38] XU J, LIU W B, XING W W, et al. MSPENet: Multi-scale Adaptive Fusion and Position Enhancement Network for Human Pose Estimation. The Visual Computer, 2023, 39(5): 2005-2019. [39] HAN J J, WANG Y X. Greit-HRNet: Grouped Lightweight High-Resolution Network for Human Pose Estimation // Proc of the Asian Conference on Computer Vision. Washington, USA: IEEE, 2024: 258-273. [40] WANG W H, XIE E Z, LI X, et al. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2021: 548-558. [41] 吴晓亮,李霆. 基于密集关键点回归的人体姿态估计.光电子技术, 2024, 44(3): 218-223. (WU X L, LI T. Human Pose Estimation Based on Dense Keypoint Regression. Optoelectronic Technology, 2024, 44(3): 218-223.) [42] CHEN S C, ZHANG Y Y, HUANG S M, et al. SDPose: Tokenized Pose Estimation via Circulation-Guide Self-Distillation // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Re-cognition. Washington, USA: IEEE, 2024: 1082-1090. [43] PENG S D, LIU Y, HUANG Q X, et al. PVNet: Pixel-Wise Voting Network for 6DoF Pose Estimation // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2019: 4556-4565. [44] WANG Y J, LUO Y M, BAI G H, et al. UformPose: A U-Shaped Hierarchical Multi-scale Keypoint-Aware Framework for Human Pose Estimation. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 33(4): 1697-1709. [45] JIANG J H, XIA N. A Dual-Channel Network Based on Occlusion Feature Compensation for Human Pose Estimation. Image and Vision Computing, 2024, 151. DOI: 10.1016/j.imavis.2024.105290.