|
|
Multi-consistency Constrained Semi-supervised Video Action Detection Based on Feature Enhancement and Residual Reshaping |
HU Zhengping1,2, ZHANG Qiming1, WANG Yulu1, ZHANG Hehao1, DI Jirui1 |
1. School of Information Science and Engineering, Yanshan University, Qinhuangdao 066004; 2. Hebei Key Laboratory of Information Transmission and Signal Processing, Yanshan University, Qinhuangdao 066004 |
|
|
Abstract The feature representations of both original data and augmented data in the consistency regularized semi-supervised video action detection method tend to induce discriminative domain bias between two types of data, thereby resulting in inadequate fitting of the discriminative results. To address this issue, a multi-consistency constrained semi-supervised video action detection method based on feature enhancement and residual reshaping is proposed in this paper. Firstly, the basic action feature descriptors are continuously enhanced and encoded in the spatiotemporal dimension to obtain crucial contextual information for video action understanding. Subsequently, a residual feature reshaping module is employed to obtain multi-scale residual information while reshaping the features. To reduce the discriminative bias between different types of data, multiple consistency constraints are applied to the original data and the augmented data from the perspectives of classification features and action localization features, achieving a match between discriminative results and feature representation of the augmented data and the original data. Experimental results on JHMDB-21 and UCF101-24 datasets demonstrate the effectiveness of the proposed method in improving video action detection accuracy under the condition of limited labeled samples and strong competitiveness.
|
Received: 03 April 2024
|
|
Fund:National Natural Science Foundation of China(No.61771420), Young Scientist Fund in National Natural Science Foundation of China(No.62001413) |
Corresponding Authors:
HU Zhengping, Ph.D., professor. His research interests include pattern recognition and video processing.
|
About author:: ZHANG Qiming, Master student. His research interests include semi-supervised video action detection. WANG Yulu, Master student. Her research interests include skeleton-based human action recognition. ZHANG Hehao, Ph.D. candidate. His research interests include 3D human pose estimation. DI Jirui, Ph.D. candidate. His research interests include fine-grained action recognition. |
|
|
|
[1] GU J X, WANG Z H, KUEN J, et al. Recent Advances in Convo-lutional Neural Networks. Pattern Recognition, 2018, 77(1): 354-377. [2] LECUN Y, BENGIO Y, HINTON G. Deep Learning. Nature, 2015, 521(7553): 436-444. [3] DEMIR U, RAWAT Y S, SHAH M. TinyVIRAT: Low-Resolution Video Action Recognition // Proc of the 25th International Confe-rence on Pattern Recognition. Washington, USA: IEEE, 2021: 7387-7394. [4] KIZILTEPE R S, GAN J Q, ESCOBAR J J. Integration of Feature and Decision Fusion with Deep Learning Architectures for Video Classification. IEEE Access, 2024, 12: 19432-19446. [5] YOSHIKAWA Y, SHIGETO Y, SHIMBO M, et al. Action Class Relation Detection and Classification Across Multiple Video Datasets. Pattern Recognition Letters, 2023, 173: 93-100. [6] TIRUPATTUR P, DUARTE K, RAWAT Y S, et al. Modeling Multi-label Action Dependencies for Temporal Action Localization // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Re-cognition. Washington, USA: IEEE, 2021: 1460-1470. [7] MU R H, MARCOLINO L, NI Q, et al. Enhancing Robustness in Video Recognition Models: Sparse Adversarial Attacks and Beyond. Neural Networks, 2024, 171: 127-143. [8] VYAS S, RAWAT Y S, SHAH M. Multi-view Action Recognition Using Cross-View Video Prediction // Proc of the 16th European Conference on Computer Vision. Berlin, Germany: Springer, 2020: 427-444. [9] JHUANG H, GALL J, ZUFFI S, et al. Towards Understanding Action Recognition // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2013: 3192-3199. [10] KAY W, CARREIRA J, SIMONYAN K, et al. The Kinetics Human Action Video Dataset[C/OL].[2024-04-23]. https://arxiv.org/abs/1705.06950. [11] SOOMRO K, ZAMIR A R, SHAH M. UCF101: A Dataset of 101 Human Actions Classes from Videos in the Wild[C/OL]. [2024-04-23].https://arxiv.org/abs/1212.0402. [12] LEE D H. Pseudo-Label: The Simple and Efficient Semi-supervised Learning Method for Deep Neural Networks // Proc of the International Conference on Machine Learning Workshop. New York, USA: ACM, 2013: 896-902. [13] HU S, LIU C H, DUTTA J, et al. PseudoProp: Robust Pseudo-Label Generation for Semi-supervised Object Detection in Autonomous Driving Systems // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Washington, USA: IEEE, 2022: 4389-4397. [14] ZHENG Z D, YANG Y. Rectifying Pseudo Label Learning via Uncertainty Estimation for Domain Adaptive Semantic Segmentation. International Journal of Computer Vision, 2021, 129(4): 1106-1120. [15] WANG Y C, WANG H C, SHEN Y J, et al. Semi-supervised Semantic Segmentation Using Unreliable Pseudo-Labels // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 4238-4247. [16] CHANG H, XIE G C, YU J, et al. A Viable Framework for Semi-supervised Learning on Realistic Dataset. Machine Learning, 2023, 112(6): 1847-1869. [17] LUO Y C, ZHU J, LI M X, et al. Smooth Neighbors on Teacher Graphs for Semi-supervised Learning // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2018: 8896-8905. [18] BERTHELOT D, CARLINI N, GOODFEELOW I, et al. Mix-Ma-tch: A Holistic Approach to Semi-supervised Learning // Proc of the 33rd International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2019: 5050-5060. [19] SAJJADI M, JAVANMARDI M, TASDIZEN T. Regularization with Stochastic Transformations and Perturbations for Deep Semi-supervised Learning // Proc of the 30th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2016: 1171-1179. [20] GRANDVALET Y, BENGIO Y. Semi-supervised Learning by Entropy Minimization // Proc of the 17th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2004: 529-536. [21] JEONG J, LEE S, KIM J, et al. Consistency-Based Semi-supervised Learning for Object Detection // Proc of the 33rd Internatio-nal Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2019: 10759-10768. [22] FAN Y, KUKLEVA A, DAI D X, et al. Revisiting Consistency Regularization for Semi-supervised Learning. International Journal of Computer Vision, 2023, 131(3): 626-643. [23] ZHONG X, YI A Y, LIU W X, et al. Background-Weakening Consistency Regularization for Semi-supervised Video Action Detection // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Washington, USA: IEEE, 2023. DOI: 10.1109/ICASSP49357.2023.10095478. [24] CARREIRA J, ZISSERMAN A. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2017: 4724-4733. [25] SIGURDSSON G A, VAROL G, WANG X L, et al. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding // Proc of the 14th European Conference on Computer Vision. Berlin, Germany: Springer,2016: 510-526. [26] VASWANI A, SHAZEER N, PARMAR N, et al. Attention Is All You Need // Proc of the 31st International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2017: 6000-6010. [27] KUMAR A, RAWAT Y S. End-to-End Semi-supervised Learning for Video Action Detection // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 14680-14690. [28] ZHANG S W, SONG L, GAO C X, et al. GLNet: Global Local Network for Weakly Supervised Action Localization. IEEE Transactions on Multimedia, 2020, 22(10): 2610-2622. [29] METTES P, SNOEK C G M, CHANG S F. Localizing Actions from Video Labels and Pseudo-annotations[C/OL].[2024-04-23]. https://arxiv.org/pdf/1707.09143v1. [30] METTES P, SNOEK C G M. Pointly-Supervised Action Localization. International Journal of Computer Vision, 2019, 127(3): 263-281. [31] CHÉRON G, ALAYRAC J B, LAPTEV I, et al. A Flexible Model for Training Action Localization with Varying Levels of Supervision // Proc of the 32nd International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2018: 950-961. [32] ESCORCIA V, DAO C D, JAIN M, et al. Guess Where? Actor-Supervision for Spatiotemporal Action Localization. Computer Vision and Image Understanding, 2020, 192(1). DOI: 10.1016/j.cviu.2019.102886. [33] ARNAB A, SUN C, NAGRANI A, et al. Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed Videos // Proc of the 16th European Conference on Computer Vision. Berlin, Germany: Springer, 2020: 751-768. [34] SINGH G, SAHA S, SAPIENZA M, et al. Online Real-Time Multiple Spatiotemporal Action Localisation and Prediction // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2017: 3657-3666. [35] HOU R, CHEN C, SHAH M. Tube Convolutional Neural Network(T-CNN) for Action Detection in Videos // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2017: 5823-5832. [36] KALOGEITON V, WEINZAEPFEL P, FERRARI V, et al. Action Tubelet Detector for Spatio-Temporal Action Localization // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2017: 4415-4423. [37] DUARTE K, RAWAT Y S, SHAH M. VideoCapsuleNet: A Simplified Network for Action Detection // Proc of the 32nd International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2018: 7610-7619. [38] SOGN L, ZHANG S W, YU G, et al. TACNet: Transition-Aware Context Network for Spatio-Temporal Action Detection // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2019: 11979-11987. [39] ZHAO J J, SNOEK C G M. Dance with Flow: Two-in-One Stream Action Detection // Proc of the IEEE/CVF Conference on Compu-ter Vision and Pattern Recognition. Washington, USA: IEEE, 2019: 9927-9936. [40] LI Y X, WANG Z X, WANG L M, et al. Actions as Moving Points // Proc of the 16th European Conference on Computer Vision. Berlin, Germany: Springer, 2020: 68-84. [41] MA X R, LUO Z G, ZHANG X, et al. Spatio-Temporal Action Detector with Self-Attention // Proc of the International Joint Conference on Neural Networks. Washington, USA: IEEE, 2021. DOI: 10.1109/IJCNN52387.2021.9533300. [42] YANG J H, WANG K, LI R F, et al. Cascading Spatio-Temporal Attention Network for Real-Time Action Detection. Machine Vision and Applications, 2023, 34(6). DOI: 10.1007/s00138-023-01457-4. |
|
|
|