Human Action Recognition Fusing Two-Stream Networks and SVM
TONG Anyang1,2, TANG Chao1,2, WANG Wenjian3
1. School of Artificial Intelligence and Big Data, Hefei University, Hefei 230601 2. Anhui Provincial Key Laboratory of Multimodal Cognitive Com-putation, Anhui University, Hefei 230601 3. School of Computer and Information Technology, Shanxi University, Taiyuan 030006
Abstract:It is difficult for the traditional two-stream convolutional neural network to understand the long-motion information, and when the long-time stream information is lost, the generalization ability of the model decreases. Therefore, a method for human action recognition fusing two-stream network and support vector machine is proposed. Firstly, RGB images of each frame in the video and their corresponding dense optical flow sequence diagrams in the vertical direction are extracted, and the spatial information and time information of actions in the video are obtained. The information is input into the spatial domain and time domain networks for pre-training, and feature extraction is carried out after pre-training. Secondly, the feature vectors with the same dimension extracted from the two-stream network are fused in parallel to improve the representation ability of feature vectors. Finally, the fused feature vectors are input into the linear support vector machine for training and classification. The experimental results based on the standard open database proves that the classification effect of the proposed method is good.
[1] BRÉMOND F, THONNAT M, ZÚÑIGA M. Video-Understanding Framework for Automatic Behavior Recognition. Behavior Research Methods, 2006, 38(3): 416-426. [2] RAMEZANI M, YAGHMAEE F. A Review on Human Action Ana-lysis in Videos for Retrieval Applications. Artificial Intelligence Review, 2016, 46(4): 485-514. [3] AZKUNE G, NÚÑEZ-MARCOS A, ARGANDA-CARRERAS I. Vision-Based Fall Detection with Convolutional Neural Networks. Wireless Communications and Mobile Computing, 2017. DOI: 10.1155/2017/9474806. [4] KOPPULA H S, SAXENA A. Anticipating Human Activities Using Object Affordances for Reactive Robotic Response. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(1): 14-29. [5] AL-FARIS M, CHIVERTON J, NDZI D, et al. A Review on Computer Vision-Based Methods for Human Action Recognition. Journal of Imaging, 2020, 6(6). DOI: 10.3390/jimaging6060046. [6] SULLIVAN J, CARLSSON S. Recognizing and Tracking Human Action // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2002: 629-644. [7] OIKONOMOPOULOS A, PATRAS I, PANTIC M. Spatiotemporal Salient Points for Visual Recognition of Human Actions. IEEE Transactions on Systems, Man, and Cybernetics(Cybernetics), 2006, 36(3): 710-719. [8] PATRONA F, CHATZITOFIS A, ZARPALAS D, et al. Motion Analysis: Action Detection, Recognition and Evaluation Based on Motion Capture Data. Pattern Recognition, 2018, 76(11): 612-622. [9] KARPATHY A, TODERICI G, SHETTY S, et al. Large-Scale Video Classification with Convolutional Neural Networks // Proc of the IEEE Conference on Computer Vision and Pattern Recognition, Washington, USA: IEEE, 2014: 1725-1732. [10] SIMONYAN K, ZISSERMAN A. Two-Stream Convolutional Networks for Action Recognition in Videos // GHAHRAMANI Z, WELLING M, CORTES C, et al., eds. Advances in Neural Information Processing Systems 27. Cambridge, USA: The MIT Press, 2014: 568-576. [11] JI S W, XU W, YANG M, et al. 3D Convolutional Neural Networks for Human Action Recognition. IEEE Transactions on Pa-ttern Analysis and Machine Intelligence, 2013, 35(1): 221-231. [12] HOCHREITER S, SCHMIDHUBER J. Long Short-Term Memory. Neural Computation, 1997, 9(8): 1735-1780. [13] WANG X H, GAO L L, WANG P, et al. Two-Stream 3D convNet Fusion for Action Recognition in Videos with Arbitrary Size and Length. IEEE Transactions on Multimedia, 2018, 20(3): 634-644. [14] LI Z Y, GAVRILYUK K, GAVVES E, et al. VideoLSTM Convolves, Attends and Flows for Action Recognition. Computer Vision and Image Understanding, 2018, 166: 41-50. [15] SAINATH T N, MOHAMED A, KINGSBURY B, et al. Deep Con-volutional Neural Networks for LVCSR // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Washington, USA: IEEE, 2013: 8614-8618. [16] FARNEBÄCK G. Two-Frame Motion Estimation Based on Polynomial Expansion // Proc of the Scandinavian Conference on Image Analysis. Berlin, Germany: Springer, 2003: 363-370. [17] GUAN Q, HUA M, HU H G. A Modified Grabcut Approach for Image Segmentation Based on Local Prior Distribution // Proc of the International Conference on Wavelet Analysis and Pattern Reco-gnition. Washington, USA: IEEE, 2017: 122-126. [18] DOLLAR P, RABAUD V, COTTRELL G, et al. Behavior Recognition via Sparse Spatio-Temporal Features // Proc of the IEEE International Workshop on Visual Surveillance and Performance Eva-luation of Tracking and Surveillance. Washington, USA: IEEE, 2005: 65-72. [19] WANG H, KLĀSER A, SCHMID C, et al. Dense Trajectories and Motion Boundary Descriptors for Action Recognition. International Journal of Computer Vision, 2013, 103(1): 60-79. [20] HALL D L, LLINAS J. An Introduction to Multisensor Data Fusion. Proceedings of the IEEE, 1997, 85(1): 6-23. [21] YANG J, YANG J Y, ZHANG D, et al. Feature Fusion: Parallel Strategy vs. Serial Strategy. Pattern Recognition, 2003, 36(6): 1369-1381. [22] TONG S, KOLLER D. Support Vector Machine Active Learning with Applications to Text Classification. Journal of Machine Lear-ning Research, 2002, 2: 45-66. [23] RODRIGUEZ M D, AHMED J, SHAH M. Action MACH a Spatio-Temporal Maximum Average Correlation Height Filter for Action Recognition // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2008. DOI: 10.1109/CVPR.2008.4587727. [24] SCHULDT C, LAPTEV I, CAPUTO B. Recognizing Human Actions: A Local SVM Approach // Proc of the 17th International Conference on Pattern Recognition. Washington, USA: IEEE, 2004, III: 32-36. [25] FEICHTENHOFER C, PINZ A, ZISSERMAN A. Convolutional Two-Stream Network Fusion for Video Action Recognition // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2016: 1933-1941. [26] SARGANO A B, WANG X F, ANGELOV P, et al. Human Action Recognition Using Transfer Learning with Deep Representations // Proc of the International Joint Conference on Neural Networks. Washington, USA: IEEE, 2017: 463-469. [27] TU Z G, XIE W, QIN Q Q, et al. Multi-stream CNN: Learning Representations Based on Human-Related Regions for Action Re-cognition. Pattern Recognition, 2018, 79(2): 32-43.