1. School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230601; 2. School of Software, Hefei University of Technology, Hefei 230601; 3. School of Computer Science and Technology, Anhui University, Hefei 230601
摘要 唇语识别是将单个说话人嘴唇运动的无声视频翻译成文字的一种技术.由于嘴唇运动幅度较小,现有唇语识别方法的特征区分能力和泛化能力都较差.针对该问题,文中分别从时间、空间和通道三个维度研究唇语视觉特征的提纯问题,提出基于多重视觉注意力的唇语识别方法(Lipreading Based on Multiple Visual Attention Network, LipMVA).首先利用通道注意力自适应校准通道级别的特征,减轻无意义通道的干扰.然后使用两种粒度不同的时空注意力,抑制不重要的像素或帧的影响.CMLR、GRID数据集上的实验表明LipMVA可降低识别错误率,由此验证方法的有效性.
Abstract:Lipreading is a technology that translates the silent video of a single speaker's lip motion into text. Due to the small amplitude of lip movements, the feature differentiation ability and the generalization ability of the model are both weak. To address this issue, the purification of lipreading visual features is studied from three dimensions including time, space and channel. A method for lipreading based on multiple visual attention network(LipMVA) is proposed. Firstly, channel-level features are calibrated adaptively by channel attention to mitigate the interference from meaningless channels. Then, two spatio-temporal attention modules with different granularities are employed to suppress the effect of unimportant pixels or frames. Finally, experiments on CMLR and GRID datasets demonstrate LipMVA can reduce the error rate and therefore its effectiveness is verified.
[1] CHUNG J S, SENIOR A, VINYALS O, et al. Lip Reading Sentences in the Wild // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington,USA: IEEE, 2017: 6447-6456. [2] ZHAO Y, XU R, SONG M L.A Cascade Sequence-to-Sequence Model for Chinese Mandarin Lip Reading // Proc of the 1st ACM Multimedia in Asia. New York, USA: ACM, 2019. DOI: 10.1145/3338533.3366579. [3] ZHANG X B, GONG H G, DAI X L, et al. Understanding Pictograph with Facial Features: End-to-End Sentence-Level Lip Reading of Chinese // Proc of the 33rd AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI, 2019: 9211-9218. [4] ZHAO Y, XU R, WANG X C, et al. Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers // Proc of the 34th AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI, 2020: 6917-6924. [5] SIMONYAN K, ZISSERMAN A.Very Deep Convolutional Networks for Large-Scale Image Recognition[C/OL]. [2023-08-22].https://arxiv.org/abs/1409.1556. [6] HE K M, ZHANG X Y, REN S Q, et al. Deep Residual Learning for Image Recognition // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2016: 770-778. [7] ASSAEL Y M, SHILLINGFORD B, WHITESON S, et al. LipNet: End-to-End Sentence-Level Lipreading[C/OL].[2023-08-22]. https://arxiv.org/pdf/1611.01599.pdf. [8] XU K, LI D W, CASSIMATIS N, et al. LCANet: End-to-End Lipreading with Cascaded Attention-CTC // Proc of the 13th IEEE International Conference on Automatic Face and Gesture Recognition. Washington, USA: IEEE, 2018: 548-555. [9] XUE F, YANG T, LIU K, et al. LCSNet: End-to-End Lipreading with Channel-Aware Feature Selection. ACM Transactions on Multimedia Computing, Communications, and Applications, 2023. DOI: 10.1145/3524620. [10] TRAN D, BOURDEV L, FERGUS R, et al. Learning Spatiotemporal Features with 3D Convolutional Networks // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2015: 4489-4497. [11] AFOURAS T, CHUNG J S, SENIOR A, et al. Deep Audio-Visual Speech Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 44(12): 8717-8727. [12] PETRIDIS S, STAFYLAKIS T, MA P C, et al. Audio-Visual Spee-ch Recognition with a Hybrid CTC/Attention Architecture // Proc of the IEEE Spoken Language Technology Workshop. Washington, USA: IEEE, 2018: 513-520. [13] SUTSKEVER I, VINYALS O, LE Q V.Sequence to Sequence Learning with Neural Networks // Proc of the 27th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2014, II: 3104-3112. [14] MA P C, PETRIDIS S, PANTIC M.Visual Speech Recognition for Multiple Languages in the Wild. Nature Machine Intelligence, 2022, 4: 930-939. [15] VASWANI A, SHAZEER N, PARMAR N, et al.Attention Is All You Need // Proc of the 31st International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2017: 6000-6010. [16] GRAVES A, FERNáNDEZ S, GOMEZ F J, et al. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks // Proc of the 23rd International Conference on Machine learning. New York, USA: ACM, 2006: 369-376. [17] COOKE M, BARKER J, CUNNINGHAM S P, et al. An Audio-Visual Corpus for Speech Perception. An Audio-Visual Corpus for Speech Perception and Automatic. The Journal of the Acoustical Society of America, 2006, 120(5): 2421-2424. [18] GRAVES A.Sequence Transduction with Recurrent Neural Networks[C/OL]. [2023-08-22]. https://arxiv.org/pdf/1211.3711.pdf. [19] SRIVASTAVA R K, GREFF K, SCHMIDHUBER J.Highway Net-works[C/OL]. [2023-08-22].https://arxiv.org/pdf/1505.00387.pdf. [20] LI X, WANG W H, HU X L, et al. Selective Kernel Networks[C/OL].[2023-08-22]. https://arxiv.org/pdf/1903.06586.pdf. [21] CHEN W C, TAN X, XIA Y C, et al. DualLip: A System for Joint Lip Reading and Generation // Proc of the 28th ACM International Conference on Multimedia. New York, USA: ACM, 2020: 1985-1993. [22] HUANG Y Y, LIANG X F, FANG C W.CALLip: Lipreading Using Contrastive and Attribute Learning // Proc of the 29th ACM International Conference on Multimedia. New York, USA: ACM, 2021: 2492-2500. [23] CHEN T, KORNBLITH S, NOROUZI M, et al. A Simple Framework for Contrastive Learning of Visual Representations // Proc of the 37th International Conference on Machine Learning. New York, USA: ACM, 2020: 1597-1607. [24] HU J, SHEN L, ALBANIE S, et al. Squeeze-and-Excitation Networks // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2018: 7132-7141. [25] WOO S, PARK J, LEE J Y, et al. CBAM: Convolutional Block Attention Module // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer,2018: 3-19. [26] RAO S, RAHMAN T, ROCHAN M, et al. Video-Based Person Re-identification Using Spatial-Temporal Attention Networks[C/OL].[2023-08-22]. https://arxiv.org/abs/1810.11261. [27] WANG Z W, SHE Q, SMOLIC A.ACTION-Net: Multipath Excitation for Action Recognition // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2021: 13209-13218. [28] BAHDANAU D, CHO K, BENGIO Y.Neural Machine Translation by Jointly Learning to Align and Translate[C/OL]. [2023-08-22].https://arxiv.org/abs/1409.0473. [29] KINGMA D P, BA J L.Adam: A Method for Stochastic Optimization[C/OL]. [2023-08-22].https://arxiv.org/abs/1412.6980. [30] BENGIO S, VINYALS O, JAITLY N, et al.Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks // Proc of the 28th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2015: 1171-1179. [31] XUE F, LI Y, LIU D Y, et al. LipFormer: Learning to Lipread Unseen Speakers Based on Visual-Landmark Transformers. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 33(9): 4507-4517. [32] ZHOU B L, KHOSLA A, LAPEDRIZA A, et al. Learning Deep Features for Discriminative Localization // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2016: 2921-2929.