Discriminatively Trained Action Recognition Model Based on Hierarchical Part Tree
QIAN Yinzhong1,2,3, SHEN Yifan1,2
1.Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, Shanghai 200433 2.School of Computer Science, Fudan University, Shanghai 200433 3.School of Software, Changzhou College of Information Technology, Changzhou 213164
Abstract:Action recognition of body pose from static image is exploited in this paper. A hierarchical part tree structure is proposed. In the structure, each node is composed by a collection of poselets to represent its pose variation and pairs of linked nodes are constrained to form a pictorial structure. Grounded on the structure, a discriminatively trained action recognition model based on hierarchical part tree is presented. Except for deforming cost, the pairwise potential function in the model introduces co-occurrence cost. Parent part contains child part and the relative position of linked nodes is described by normal distribution, and thus the matching procedure is inferred efficiently in the framework of distance transform and message passing. Three models with different number of nodes by trimming the tree are comparatively evaluated on two datasets. Experimental results demonstrate that coarse parts in former two layers have strong saliency for action recognition, the recognition capability is further improved by body parts in the third layer, and the anatomical stick parts in the fourth layer are basically not useful for action recognition.
[1] GUO G D, LAI A. A Survey on Still Image Based Human Action Recognition. Pattern Recognition, 2014, 47(10): 3343-3361. [2] AGGARWAL J K, RYOO M S. Human Activity Analysis: A Review. ACM Computing Survey, 2011, 43(3). DOI: 10.1145/1922649.1922653. [3] DALAL N, TRIGGS B. Histograms of Oriented Gradients for Human Detection // Proc of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2005: 886-893. [4] IKIZLER-CINBIS N, CINBIS R G, SCLAROFF S. Learning Actions from the Web // Proc of the 12th IEEE International Confe-rence on Computer Vision. Washington, USA: IEEE, 2009: 995-1002. [5] FELZENSZWALB P F, HUTTENLOCHER D P. Pictorial Structures for Object Recognition. International Journal of Computer Vision, 2005, 61(1): 55-79. [6] YANG Y, REMANAN D. Articulated Pose Estimation with Flexible Mixtures-of-Parts // Proc of the IEEE International Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2011: 1385-1392. [7] YANG W, OUYANG W L, LI H S, et al. End-to-End Learning of Deformable Mixture of Parts and Deep Convolutional Neural Networks for Human Pose Estimation // Proc of the IEEE International Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2016: 3073-3082. [8] GIRSHICK R, IANDOLA F, DARREL T, et al. Deformable Part Models are Convolutional Neural Networks // Proc of the IEEE International Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2015: 437-446. [9] FAN X C, ZHENG K, LIN Y W, et al. Combining Local Appea-rance and Holistic View: Dual-Source Deep Neural Networks for Human Pose Estimation // Proc of the IEEE International Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2015: 1347-1355. [10] FAN X C, OUYANG W L, LI H S, et al. Structured Feature Learning for Pose Estimation // Proc of the IEEE International Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2016: 4715-4723. [11] BOURDEV L, MALIK J. Poselet: Body Part Detectors Trained Using 3D Human Pose Annotations // Proc of the 12th IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2009: 1365-1372. [12] BOURDEV L, MAJI S, BROX T, et al. Detecting People Using Mutually Consistent Poselet Activations // Proc of the 11th European Conference on Computer Vision. Berlin, Germany: Springer-Verlag, 2010: 168-181. [13] YANG W L, WANG Y, MORI G. Recognizing Human Actions from Still Images with Latent Poses // Proc of the IEEE International Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2010: 2030-2037. [14] WANG Y, TRAN D, LIAO Z C, et al. Discriminative Hierarchical Part-based Models for Human Parsing and Action Recognition. Journal of Machine Learning Research, 2012, 13: 3075-3102. [15] MAJI S, BOURDEV L, MALIK J. Action Recognition from a Distributed Representation of Pose and Appearance // Proc of the IEEE International Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2011: 3177-3184. [16] YAO B P, JIANG X Y, KHOSLA A, et al. Human Action Recog- nition by Learning Bases of Action Attributes and Parts // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2011: 1331-1338. [17] FELZENSZWALB P F, HUTTENLOCHER D P. Distance Transforms of Sampled Functions. Theory of Computing, 2004, 8(19): 415-428. [18] JOACHIMS T, FINLEY T, YU C N J. Cutting-Plane Training of Structural SVMs. Machine Learning, 2009, 77(1): 27-59. [19] CHANG M W, YIH W T. Dual Coordinate Descent Algorithms for Efficient Large Margin Structured Prediction[C/OL]. [2017-03-30]. https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/DCD-SSVM.pdf. [20] RAMANAN D. Dual Coordinate Solvers for Large-Scale Structural SVMs[J/OL]. [2017-03-30]. https://studylib.net/doc/18278267/dual-coordinate-solvers-for-large. [21] NIEBLES J C, HAN B, FERENCZ A, et al. Extracting Moving People from Internet Videos // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer-Verlag, 2008: 527-540.