Temporal Group Deep Network Action Recognition Algorithm Based on Attention Mechanism
HU Zhengping1,2, DIAO Pengcheng1, ZHANG Ruixue1, LI Shufang1, ZHAO Mengyao1
1.School of Information Science and Engineering, Yanshan University, Qinhuangdao 066004; 2.Hebei Key Laboratory of Information Transmission and Signal Processing, Yanshan University, Qinhuangdao 066004
Abstract:Inspired by the mechanism of human visual perception, a temporal group deep network action recognition algorithm based on attention mechanism is proposed under the framework of deep learning. Aiming at the deficiency of local temporal information in describing complex actions with a long duration, the video packet sparse sampling strategy is employed to conduct video level time modeling at a lower cost. In the recognition stage, channel attention mapping is introduced to further utilize global feature information and capture classified interest points, and channel feature recalibration is performed to improve the expression ability of the network. Experimental results on UCF101 and HMDB51 datasets show that the recognition accuracy of the proposed algorithm is high.
[1] WANG H, SCHIMID C. Action Recognition with Improved Trajectories // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2013: 3551-3558. [2] SANDLER M, HOWARD A, ZHU M L, et al. Mobilenetv2: Inverted Residuals and Linear Bottlenecks // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2018: 4510-4520. [3] ZHANG X Y, ZHOU X Y, LIN M X, et al. Shufflenet: An Extremely Efficient Convolutional Neural Network for Mobile Devices // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2018: 6848-6856. [4] HE K M, ZHANG X Y, REN S Q, et al. Identity Mappings in Deep Residual Networks // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2016: 630-645. [5] WANG X L, GIRSHICK R, GUPTA A, et al. Non-local Neural Networks // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2018: 7794-7803. [6] HUANG G, LIU Z, VAN DER MAATEN L, et al. Densely Connected Convolutional Networks // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2017: 2261-2269. [7] KARPATHY A, TODERICI G, SHETTY S, et al. Large-Scale Vi-deo Classification with Convolutional Neural Networks // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2014: 1725-1732. [8] WANG L M, QIAO Y, TANG X O. Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2015: 4305-4314. [9] SIMONYAN K, ZISSERMAN A. Two-Stream Convolutional Networks for Action Recognition in Videos // Proc of the 27th International Conference on Neural Information Processing Systems. Cambridge, USA: The MIT Press, 2014, I: 568-576. [10] NG J Y H, HAUSKNECHT M, VIJAYANARASIMHAN S, et al. Beyond Short Snippets: Deep Networks for Video Classification // Proc of the IEEE Conference on Computer Vision and Pattern Re-cognition. Washington, USA: IEEE, 2015: 4694-4702. [11] TRAN D, BOURDEV L, FERGUS R, et al. Learning Spatiotemporal Features with 3D Convolutional Networks // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2015: 4489-4497. [12] SIMONYAN K, ZISSERMAN A. Very Deep Convolutional Networks for Large-Scale Image Recognition[C/OL]. [2019-05-20]. https://arxiv.org/pdf/1409.1556.pdf. [13] HE K M, ZHANG X Y, REN S Q, et al. Deep Residual Learning for Image Recognition // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2016: 770-778. [14] SZEGEDY C, LIU W, JIA Y Q, et al. Going Deeper with Convolutions // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2015: 1-9. [15] XIE S N, GIRSHICK R, DOLLÁR P, et al. Aggregated Residual Transformations for Deep Neural Networks // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2017: 5987-5995. [16] HU J, SHEN L, ALBANIE S, et al. Squeeze-and-Excitation Networks // Proc of the IEEE Conference on Computer Vision and Pa-ttern Recognition. Washington, USA: IEEE, 2018: 7132-7141. [17] MA C Y, CHEN M H, KIRA Z, et al. TS-LSTM and Temporal-Inception: Exploiting Spatiotemporal Dynamics for Activity Recognition. Signal Processing(Image Communication), 2019, 71: 76-87. [18] SELVARAJU R R, COGSWELL M, DAS A, et al. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Loca-lization // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2017: 618-626. [19] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. Imagenet Cla-ssification with Deep Convolutional Neural Networks // Proc of the 25th International Conference on Neural Information Processing Systems. Cambridge, USA: The MIT Press, 2012: 1097-1105. [20] WU C Y, ZAHEER M, HU H X, et al. Compressed Video Action Recognition // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2018: 6026-6035. [21] QIU Z F, YAO T, MEI T. Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2017: 5533-5541. [22] DIBA A, FAYYAZ M, SHARMA V, et al. Temporal 3D ConvNets Using Temporal Transition Layer // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2018: 1230-1234.