1.School of Automation and Electrical Engineering, University of Science and Technology Beijing, Beijing 100083; 2.Shunde Graduate School, University of Science and Technology Beijing, Foshan 528399; 3.Beijing Engineering Research Center of Industrial Spectrum Imaging, University of Science and Technology Beijing, Beijing 100083; 4.School of Advanced Engineering, University of Science and Technology Beijing, Beijing 100083
Abstract:Aiming at the spatio-temporal modeling in video action recognition, a temporal enhanced action recognition algorithm based on fused spatio-temporal features is proposed under the deep learning framework. To lower the cost of video-level temporal modeling, a sparse sampling strategy is employed to adapt to video duration changes. In the recognition stage, temporal difference between adjacent feature maps is calculated to enhance the motion information in the feature level. The combination of residual structure and temporal enhanced structure is introduced to further improve the representation ability of the network. Experimental results show that the proposed algorithm obtains higher accuracy on UCF101 and HMDB51 datasets and achieves better results in the actual industrial operation recognition scene with a smaller network scale.
[1] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet Classification with Deep Convolutional Neural Networks // Proc of the 25th International Conference on Neural Information Processing Systems. Cambridge, USA: The MIT Press, 2012: 1097-1105. [2] REN S Q, HE K M, GIRSHICK R, et al.Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks // Proc of the 28th International Conference on Neural Information Proce-ssing Systems. Cambridge, USA: The MIT Press, 2015: 1137-1149. [3] 吴帅,徐勇,赵东宁.基于深度卷积网络的目标检测综述.模式识别与人工智能, 2018, 31(4): 335-346. (WU S, XU Y, ZHAO D N. Survey of Object Detection Based on Deep Convolutional Network. Pattern Recognition and Artificial Intelligence, 2018, 31(4): 335-346. [4] SIMONYAN K, ZISSERMAN A. Two-Stream Convolutional Networks for Action Recognition in Videos // Proc of the 27th International Conference on Neural Information Processing Systems. Cambridge, USA: The MIT Press, 2014: 568-576. [5] WANG L M, XIONG Y J, WANG Z, et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2016: 20-36. [6] ZHOU B L, ANDONIAN A, OLIVA A, et al. Temporal Relational Reasoning in Videos // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 803-818. [7] DIBA A, FAYYAZ M, SHARMA V, et al. Temporal 3d ConvNets: New Architecture and Transfer Learning for Video Classification[C/OL].[2020-04-22]. https://arxiv.org/pdf/1711.08200.pdf. [8] JI S W, XU W, YANG M, et al. 3D Convolutional Neural Networks for Human Action Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(1): 221-231. [9] TRAN D, BOURDEV L, FERGUS R, et al. Learning Spatiotemporal Features with 3D Convolutional Networks // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2015: 4489-4497. [10] XIE S N, SUN C, HUANG J, et al. Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 318-335. [11] ZOLFAGHARI M, SINGH K, BROX T. ECO: Efficient Convolutional Network for Online Video Understanding // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 713-730. [12] TRAN D, WANG H, TORRESANI L, et al. A Closer Look at Spatiotemporal Convolutions for Action Recognition // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2018: 6450-6459. [13] LIN J, GAN C, HAN S. TSM: Temporal Shift Module for Efficient Video Understanding // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2019: 7083-7093. [14] HE D L, ZHOU Z C, GAN C, et al. StNet: Local and Global Spatial-Temporal Modeling for Action Recognition // Proc of the AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2019: 8401-8408. [15] QIU Z F, YAO T, NGO C W, et al. Learning Spatio-Temporal Representation with Local and Global Diffusion // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2019: 12056-12065. [16] TRAN D, WANG H, TORRESANI L, et al. Video Classification with Channel-Separated Convolutional Networks[C/OL].[2020-04-22]. https://arxiv.org/pdf/1904.02811.pdf. [17] HE K M, ZHANG X Y, REN S Q, et al. Deep Residual Learning for Image Recognition // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2016: 770-778. [18] SOOMRO K, ZAMIR A R, SHAH M. UCF101: A Dataset of 101 Human Actions Classes from Videos in the Wild[C/OL]. [2020-04-22]. https://arxiv.org/pdf/1212.0402.pdf. [19] KUEHNE H, JHUANG H, GARROTE E, et al. HMDB: A Large Video Database for Human Motion Recognition // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2011. DOI: 10.1109/ICCV.2011.6126543. [20] 刘苍林. 金属切削机床实训教程.天津:天津大学出版社, 2009. (LIU C L. Training Course for Metal Cutting Machine Tools. Tianjin, China: Tianjin University Press, 2009.) [21] GLOROT X, BENGIO Y. Understanding the Difficulty of Training Deep Feedforward Neural Networks. Journal of Machine Learning Research, 2010, 9: 249-256. [22] DIBA A, FAYYAZ M, SHARMA V, et al. Spatio-Temporal Cha-nnel Correlation Networks for Action Classification // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 284-299. [23] VAROL G, LAPTEV I, SCHMI C. Long-Term Temporal Convolutions for Action Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(6): 1510-1517. [24] TAO L, WANG X T, YAMASAKI T. Rethinking Motion Representation: Residual Frames with 3D ConvNets for Better Action Recognition[C/OL]. [2020-04-22].https://arxiv.org/pdf/2001.05661.pdf. [25] MOLCHANOV P, TYREE S, KARRAS T, et al. Pruning Convolutional Neural Networks for Resource Efficient Inference[C/OL].[2020-04-22]. https://arxiv.org/pdf/1611.06440.pdf. [26] ZHOU B L, KHOSLA A, LAPEDRIZA A, et al. Learning Deep Features for Discriminative Localization // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2016: 2921-2929.