Multi-person Human Pose Estimation Based on Deformable Convolution
ZHAO Yunxiao1,2,3, QIAN Yuhua1,3, WANG Keqi1,3
1.Institute of Big Data Science and Industry, Shanxi University, Taiyuan 030006; 2.Department of Computer Science and Technology, Lüliang University, Lüliang 033000;; 3.School of Computer and Information Technology, Shanxi University, Taiyuan 030006
Abstract:Deep neural networks for human pose estimation all sample at the fixed position of the feature map, and therefore it is difficult to model the geometric transformation of human pose. The generalization ability of the network is poor with the variation of the size, pose and shooting angle of the human instance. To solve this problem, multi-person human pose estimation based on deformable convolution is proposed.Based on the strong ability of deformable convolution in modeling geometric transformation of targets, a feature extraction module is designed to ensure the detection accuracy under the geometric changes of human key points. To further improve the performance of the network, the prediction value of the model and the truth value generated by the two-dimensional Gaussian model are employed to calculate the loss, and the model is trained iteratively. The human key points are detected effectively by the proposed model under the complex conditions, such as shooting angle, attachment and character scale changes. The experiment shows that the proposed model effectively improves the accuracy of human key point detection.
[1] CHEN Y L, WANG Z C, PENG Y X, et al. Cascaded Pyramid Network for Multi-person Pose Estimation // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2018: 7103-7112. [2] HE K M, GKIOXARI G, DOLLÁR P, et al. Mask R-CNN. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(2): 386-397. [3] YANG W, LI S, OUYANG W L, et al. Learning Feature Pyramids for Human Pose Estimation // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2017: 1290-1299. [4] HUANG S L, GONG M M, TAO D C. A Coarse-Fine Network for Keypoint Localizaiton // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2017:3047-3056. [5] FANG H S, XIE S Q, TAI Y W, et al. RMPE: Regional Multi-person Pose Estimation // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2017: 2353-2362. [6] NEWELL A, HUANG Z, DENG J. Associative Embedding: End-to-End Learning for Joint Detection and Grouping // GUYON I, LUXBURG U V, BENGIO S, et al., eds. Advances in Neural Information Processing Systems 30. Cambridge, USA: The MIT Press, 2017. [7] CAO Z, MARTINEZ G H, SIMON T, et al. OpenPose: Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields[C/OL].[2020-05-21]. https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8765346. [8] NEWELL A, YANG K Y, DENG J. Stacked Hourglass Networks for Human Pose Estimation // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2016: 483-499. [9] XIAO B, WU H P, WEI Y C. Simple Baselines for Human Pose Estimation and Tracking // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 472-487. [10] JADERBERG M, SIMONYAN K, ZISSERMAN A, et al. Spatial Transformer Networks // Proc of the 28th International Conference on Neural Information Processing Systems. Cambridge, USA: The MIT Press, 2015: 2017-2025. [11] WANG F, JIANG M Q, QIAN C, et al. Residual Attention Network for Image Classification // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2017: 6450-6458. [12] DAI J F, QI H Z, XIONG Y W, et al. Deformable Convolutional Networks[C/OL].[2020-05-21]. https://arxiv.org/pdf/1703.06211.pdf. [13] ZHU X Z, HU H, LIN S, et al. Deformable ConvNets v2: More Deformable, Better Results // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2019: 9308-9316. [14] HE K M, ZHANG X Y, REN S Q, et al. Identity Mappings in Deep Residual Networks // Proc of the European Conference on Com-puter Vision. Berlin, Germany: Springer, 2016: 630-645. [15] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: Common Objects in Context // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2014: 740-755. [16] REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149. [17] WANG J Q, CHEN K, YANG S, et al. Region Proposal by Guided Anchoring // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2019: 2965-2974. [18] ANDRILUKA M, IQBAL U, INSAFUTDINOV E, et al. PoseTrack: A Benchmark or Human Pose Estimation and Tracking // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2018: 5167-5176. [19] KENDALL A, GRIMES M, CIPOLLA R. PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2015: 2938-2946. [20] IQBAL U, MILAN A, GALL J. PoseTrack: Joint Multi-person Pose Estimation and Tracking // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2017: 4654-4663.