Temporal Action Unit Perception Based Open Set Action Recognition
YANG Kaixiang1, GAO Junyu2, FENG Yangbo1, XU Changsheng2
1. School of Computer Science and Engineering, Tianjin University of Technology, Tianjin 300382; 2. State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190
Abstract:In open set action recognition tasks, a model is requested to identify categories within the training set accurately and reject unknown actions that never appear in the training set. Currently, most of the methods treat the action as a whole, ignoring the fact that the action can be decomposed into finer-grained action units. To address this issue, a method for temporal action unit perception based open set action recognition is proposed in this paper. Firstly, an action unit relationship module is designed to learn fine-grained features of action units, and thus the relational pattern between actions and action units is obtained. The unknown actions are identified according to the different degrees of activation of known and unknown actions on action units. Secondly, an action unit temporal module is designed to model the temporal information of action units. The temporal characteristics of action units are explored to further distinguish between known actions and unknown actions that are visually similar but confusable with each other. Finally, with comprehensive consideration of both relational patterns and temporal information of action units, the model is equipped with the capability of distinguishing known actions from unknown actions. Experimental results on three action recognition datasets demonstrate the superior performance of the proposed method.
[1] BENDALE A, BOULT T E.Towards Open Set Deep Networks // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2016: 1563-1572. [2] CHEN G Y, QIAO L M, SHI Y M, et al. Learning Open Set Network with Discriminative Reciprocal Points // Proc of the 16th European Conference on Computer Vision. Berlin, Germany: Springer, 2020: 507-522. [3] KRISHNAN R, SUBEDAR M, TICKOO O.BAR: Bayesian Activity Recognition Using Variational Inference[C/OL]. [2023-06-22].https://arxiv.org/pdf/1811.03305.pdf. [4] BAO W T, YU Q, KONG Y.Evidential Deep Learning for Open Set Action Recognition // Proc of the IEEE/CVF International Confe-rence on Computer Vision. Washington, USA: IEEE, 2021: 13329-13338. [5] LUO W, ZHANG T Z, YANG W F, et al. Action Unit Memory Network for Weakly Supervised Temporal Action Localization // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2021: 9969-9979. [6] SOOMRO K, ZAMIR A R, SHAH M.UCF101: A Dataset of 101 Human Actions Classes from Videos in the Wild[C/OL]. [2023-06-22].https://arxiv.org/abs/1212.0402. [7] KUEHNE H, JHUANG H, GARROTE E, et al. HMDB: A Large Video Database for Human Motion Recognition // Proc of the International Conference on Computer Vision. Washington, USA: IEEE, 2011: 2556-2563. [8] MONFORT M, PAN B W, RAMAKRISHNAN K, et al. Multi-moments in Time: Learning and Interpreting Models for Multi-action Video Understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(12): 9434-9445. [9] SETHI I K, JAIN R.Finding Trajectories of Feature Points in a Monocular Image Sequence. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1987, 9(1): 56-73. [10] ROSTEN E, DRUMMOND T.Machine Learning for High-Speed Corner Detection // Proc of the 9th European Conference on Computer Vision. Berlin, Germany: Springer, 2006: 430-443. [11] WANG H, KLÄSER A, SCHMID C, et al. Dense Trajectories and Motion Boundary Descriptors for Action Recognition. International Journal of Computer Vision, 2013, 103: 60-79. [12] SIMONYAN K, ZISSERMAN A.Two-Stream Convolutional Networks for Action Recognition in Videos // Proc of the 27th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2014, I: 568-576. [13] NG J Y, HAUSKNECHT M, VIJAYANARASIMHAN S, et al. Be-yond Short Snippets: Deep Networks for Video Classification // Proc of the IEEE Conference on Computer Vision and Pattern Re-cognition. Washington, USA: IEEE, 2015: 4694-4702. [14] ARANDJELOVIĆ R, GRONAT P, TORII A, et al. NetVLAD: CNN Architecture for Weakly Supervised Place Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(6): 1437-1451. [15] TRAN D, BOURDEV L, FERGUS R, et al. Learning Spatiotemporal Features with 3d Convolutional Networks // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2015: 4489-4497. [16] 杨兴明,范楼苗.基于区域特征融合网络的群组行为识别.模式识别与人工智能, 2019, 32(12): 1116-1121. (YANG X M, FAN L M.Group Activity Recognition Based on Regional Feature Fusion Network. Pattern Recognition and Artificial Intelligence, 2019, 32(12): 1116-1121.) [17] 张浩博,付冬梅,周珂.时序增强的视频动作识别方法.模式识别与人工智能, 2020, 33(10): 951-958. (ZHANG H B, FU D M, ZHOU K.Video-Based Temporal Enhanced Action Recognition. Pattern Recognition and Artificial Intelligence, 2020, 33(10): 951-958.) [18] 胡正平,刁鹏成,张瑞雪,等.基于注意力机制的时间分组深度网络行为识别算法.模式识别与人工智能, 2019, 32(10): 892-900. (HU Z P, DIAO P C, ZHANG R X, et al. Temporal Group Deep Network Action Recognition Algorithm Based on Attention Mecha-nism. Pattern Recognition and Artificial Intelligence, 2019, 32(10): 892-900.) [19] CARREIRA J, ZISSERMAN A. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset // Proc of the IEEE Confe-rence on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2017: 4724-4733. [20] 黄敏,尚瑞欣,钱惠敏.面向视频中人体行为识别的复合型深度神经网络.模式识别与人工智能, 2022, 35(6): 562-570. (HUANG M, SHANG R X, QIAN H M.Composite Deep Neural Network for Human Activities Recognition in Video. Pattern Re-cognition and Artificial Intelligence, 2022, 35(6): 562-570.) [21] 童安炀,唐超,王文剑.基于双流网络与支持向量机融合的人体行为识别.模式识别与人工智能, 2021, 34(9): 863-870. (TONG A Y, TANG C, WANG W J.Human Action Recognition Fusing Two-Stream Networks and SVM. Pattern Recognition and Artificial Intelligence, 2021, 34(9): 863-870.) [22] VASWANI A, SHAZEER N, PARMAR N, et al.Attention Is All You Need // Proc of the 31st International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2017: 6000-6010. [23] ZHA X F, ZHU W T, XUN L, et al.Shifted Chunk Transformer for Spatio-Temporal Representational Learning // Proc of the 35th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2021: 11384-11396. [24] XING Z, DAI Q, HU H, et al. SVFormer: Semi-Supervised Video Transformer for Action Recognition // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023: 18816-18826. [25] AHN D, KIM S, HONG H, et al. STAR-Transformer: A Spatio-Temporal Cross Attention Transformer for Human Action Recognition // Proc of the IEEE/CVF Winter Conference on Applications of Computer Vision. Washington, USA: IEEE, 2023: 3319-3328. [26] HONG Y, KIM M J, LEE I, et al. Fluxformer: Flow-Guided Duplex Attention Transformer via Spatio-Temporal Clustering for Action Recognition. IEEE Robotics and Automation Letters, 2023, 8(10): 6411-6418. [27] LI F Y, WECHSLER H.Open Set Face Recognition Using Transduction. IEEE Transactions on Pattern Analysis and Machine Inte-lligence, 2005, 27(11): 1686-1697. [28] 郭凌云,李国和,龚匡丰,等.图像分布外检测研究综述.模式识别与人工智能, 2023, 36(7): 613-633. (GUO L Y, LI G H, GONG K F, et al. Research on Image Out-of-Distribution Detection: A Review. Pattern Recognition and Artificial Intelligence, 2023, 36(7): 613-633.) [29] GE Z Y, DEMYANOV S, CHEN Z T, et al. Generative Openmax for Multi-class Open Set Classification[C/OL].[2023-06-22]. https://arxiv.org/pdf/1707.07418.pdf. [30] BUSTO P P, IQBAL A, GALL J.Open Set Domain Adaptation for Image and Action Recognition. IEEE Transactions on Pattern Ana-lysis and Machine Intelligence, 2020, 42(2): 413-429. [31] FENG Y B, GAO J Y, YANG S C, et al. Spatial-Temporal Exclusive Capsule Network for Open Set Action Recognition. IEEE Tran-sactions on Multimedia, 2023. DOI: 10.1109/TMM.2023.3252275. [32] CEN J, ZHANG S W, WANG X, et al. Enlarging Instance-Speci-fic and Class-Specific Information for Open-Set Action Recognition // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023: 15295-15304. [33] ZHAO C, DU D W, HOOGS A, et al. Open Set Action Recognition via Multi-label Evidential Learning // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023: 22982-22991. [34] DU D W, SHRINGI A, HOOGS A, et al. Reconstructing Humpty Dumpty: Multi-feature Graph Autoencoder for Open Set Action Recognition // Proc of the IEEE/CVF Winter Conference on Applications of Computer Vision. Washington, USA: IEEE, 2023: 3360-3369. [35] RAKTHANMANON T, CAMPANA B, MUEEN A, et al. Sear-ching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping // Proc of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM, 2012: 262-270. [36] HADJI I, DERPANIS K G, JEPSON A D.Representation Lear-ning via Global Temporal Alignment and Cycle-Consistency // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Re-cognition. Washington, USA: IEEE, 2021: 11063-11072. [37] LIN J, GAN C, HAN S.TSM: Temporal Shift Module for Efficient Video Understanding // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2019: 7082-7092. [38] GAL Y, GHAHRAMANI Z.Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning // Proc of the 33rd International Conference on Machine Learning. San Diego, USA: JMLR, 2016: 1050-1059. [39] HENDRYCKS D, GIMPEL K.A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks[C/OL]. [2023-06-22].https://arxiv.org/pdf/1610.02136.pdf. [40] FEICHTENHOFER C, FAN H Q, MALIK J, et al. SlowFast Networks for Video Recognition // Proc of the IEEE/CVF Internatio-nal Conference on Computer Vision. Washington, USA: IEEE, 2019: 6201-6210. [41] YANG C Y, XU Y H, SHI J P, et al. Temporal Pyramid Network for Action Recognition // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2020: 588-597.