时序增强的视频动作识别方法

doi:10.16451/j.cnki.issn1003-6059.202010010

Abstract
Figure/Table
References
Related Citation (15)

Download: PDF (3103 KB) HTML (1 KB)
Export: BibTeX | EndNote (RIS)

Abstract Aiming at the spatio-temporal modeling in video action recognition, a temporal enhanced action recognition algorithm based on fused spatio-temporal features is proposed under the deep learning framework. To lower the cost of video-level temporal modeling, a sparse sampling strategy is employed to adapt to video duration changes. In the recognition stage, temporal difference between adjacent feature maps is calculated to enhance the motion information in the feature level. The combination of residual structure and temporal enhanced structure is introduced to further improve the representation ability of the network. Experimental results show that the proposed algorithm obtains higher accuracy on UCF101 and HMDB51 datasets and achieves better results in the actual industrial operation recognition scene with a smaller network scale.

Key words： Action Recognition Deep Learning Temporal Enhanced Structure Industrial Surveillance Video

Received: 16 May 2020

ZTFLH:

TP391.4

Corresponding Authors: FU Dongmei, Ph.D., professor. Her research interests include image processing and data mining.

About author:: ZHANG Haobo, master student. Her research interests include deep learning and video action recognition.ZHOU Ke, master, senior engineer. Her research interests include deep learning and image recognition.

	Service

	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	ZHANG Haobo
	FU Dongmei
	ZHOU Ke

Cite this article:

ZHANG Haobo,FU Dongmei,ZHOU Ke. Video-Based Temporal Enhanced Action Recognition[J]. , 2020, 33(10): 951-958.

URL:

http://manu46.magtech.com.cn/Jweb_prai/EN/10.16451/j.cnki.issn1003-6059.202010010 OR http://manu46.magtech.com.cn/Jweb_prai/EN/Y2020/V33/I10/951

[1] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet Classification with Deep Convolutional Neural Networks // Proc of the 25th International Conference on Neural Information Processing Systems. Cambridge, USA: The MIT Press, 2012: 1097-1105.
[2] REN S Q, HE K M, GIRSHICK R, et al.Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks // Proc of the 28th International Conference on Neural Information Proce-ssing Systems. Cambridge, USA: The MIT Press, 2015: 1137-1149.
[3] 吴帅,徐勇,赵东宁.基于深度卷积网络的目标检测综述.模式识别与人工智能, 2018, 31(4): 335-346.
(WU S, XU Y, ZHAO D N. Survey of Object Detection Based on Deep Convolutional Network. Pattern Recognition and Artificial Intelligence, 2018, 31(4): 335-346.
[4] SIMONYAN K, ZISSERMAN A. Two-Stream Convolutional Networks for Action Recognition in Videos // Proc of the 27th International Conference on Neural Information Processing Systems. Cambridge, USA: The MIT Press, 2014: 568-576.
[5] WANG L M, XIONG Y J, WANG Z, et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2016: 20-36.
[6] ZHOU B L, ANDONIAN A, OLIVA A, et al. Temporal Relational Reasoning in Videos // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 803-818.
[7] DIBA A, FAYYAZ M, SHARMA V, et al. Temporal 3d ConvNets: New Architecture and Transfer Learning for Video Classification[C/OL].[2020-04-22]. https://arxiv.org/pdf/1711.08200.pdf.
[8] JI S W, XU W, YANG M, et al. 3D Convolutional Neural Networks for Human Action Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(1): 221-231.
[9] TRAN D, BOURDEV L, FERGUS R, et al. Learning Spatiotemporal Features with 3D Convolutional Networks // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2015: 4489-4497.
[10] XIE S N, SUN C, HUANG J, et al. Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 318-335.
[11] ZOLFAGHARI M, SINGH K, BROX T. ECO: Efficient Convolutional Network for Online Video Understanding // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 713-730.
[12] TRAN D, WANG H, TORRESANI L, et al. A Closer Look at Spatiotemporal Convolutions for Action Recognition // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2018: 6450-6459.
[13] LIN J, GAN C, HAN S. TSM: Temporal Shift Module for Efficient Video Understanding // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2019: 7083-7093.
[14] HE D L, ZHOU Z C, GAN C, et al. StNet: Local and Global Spatial-Temporal Modeling for Action Recognition // Proc of the AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2019: 8401-8408.
[15] QIU Z F, YAO T, NGO C W, et al. Learning Spatio-Temporal Representation with Local and Global Diffusion // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2019: 12056-12065.
[16] TRAN D, WANG H, TORRESANI L, et al. Video Classification with Channel-Separated Convolutional Networks[C/OL].[2020-04-22]. https://arxiv.org/pdf/1904.02811.pdf.
[17] HE K M, ZHANG X Y, REN S Q, et al. Deep Residual Learning for Image Recognition // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2016: 770-778.
[18] SOOMRO K, ZAMIR A R, SHAH M. UCF101: A Dataset of 101 Human Actions Classes from Videos in the Wild[C/OL]. [2020-04-22]. https://arxiv.org/pdf/1212.0402.pdf.
[19] KUEHNE H, JHUANG H, GARROTE E, et al. HMDB: A Large Video Database for Human Motion Recognition // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2011. DOI: 10.1109/ICCV.2011.6126543.
[20] 刘苍林. 金属切削机床实训教程.天津:天津大学出版社, 2009.
(LIU C L. Training Course for Metal Cutting Machine Tools. Tianjin, China: Tianjin University Press, 2009.)
[21] GLOROT X, BENGIO Y. Understanding the Difficulty of Training Deep Feedforward Neural Networks. Journal of Machine Learning Research, 2010, 9: 249-256.
[22] DIBA A, FAYYAZ M, SHARMA V, et al. Spatio-Temporal Cha-nnel Correlation Networks for Action Classification // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 284-299.
[23] VAROL G, LAPTEV I, SCHMI C. Long-Term Temporal Convolutions for Action Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(6): 1510-1517.
[24] TAO L, WANG X T, YAMASAKI T. Rethinking Motion Representation: Residual Frames with 3D ConvNets for Better Action Recognition[C/OL]. [2020-04-22].https://arxiv.org/pdf/2001.05661.pdf.
[25] MOLCHANOV P, TYREE S, KARRAS T, et al. Pruning Convolutional Neural Networks for Resource Efficient Inference[C/OL].[2020-04-22]. https://arxiv.org/pdf/1611.06440.pdf.
[26] ZHOU B L, KHOSLA A, LAPEDRIZA A, et al. Learning Deep Features for Discriminative Localization // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2016: 2921-2929.