|
|
Chinese Lipreading Network Based on Vision Transformer |
XUE Feng1, HONG Zikun2, LI Shujie1, LI Yu2, XIE Yincen2 |
1. School of Software, Hefei University of Technology, Hefei 230601; 2. School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230601 |
|
|
Abstract Lipreading is a multimodal task to convert lipreading videos into text, and it is intended to understand the meaning expressed by a speaker in the absence of sound. In the existing lipreading methods, convolutional neural networks are adopted to extract visual features of the lips and capture short-distance pixel relationships, resulting in difficulties in distinguishing lip shapes of similarly pronounced characters. To capture the long-distance relationship between pixels in the lip region of the video images, an end-to-end Chinese sentence-level lipreading model based on vision transformer(ViT) is proposed. The ability of the model to extract visual spatio-temporal features from lip videos is improved by fusing ViT and Gate Recurrent Unit(GRU). Firstly, the global spatial features of lip images are extracted using the self-attention module of ViT. Then, GRU is employed to model the temporal sequence of frames. Finally, the cascading sequence-to-sequence model based on the attention mechanism is utilized to predict Chinese pinyin and Chinese character utterances. Experimental results on Chinese lipreading dataset CMLR show that the proposed model produces a lower Chinese character error rate.
|
Received: 08 July 2022
|
|
Fund:National Natural Science Foundation of China(No.62272143), University Synergy Innovation Program of Anhui Province(No.GXXT-2022-054), Anhui Provincial Major Science and Technology Project(No.202203a05020025), The Se-venth Special Support Plan for Innovation and Entrepreneurship in Anhui Province |
Corresponding Authors:
XUE Feng, Ph.D., professor. His research interests include artificial intelligence, multimedia analysis and recommendation system.
|
About author:: HONG Zikun, master student. His research interests include computer vision.LI Shujie, Ph.D., lecturer. Her research interests include computer vision and human pose estimation.LI Yu, Ph.D. candidate. Her research interests include computer vision.XIE Yincen, master student. His research interests include computer vision. |
|
|
|
[1] ASSAEL Y M, SHILLINGFORD B, WHITESON S, et al. LipNet: End-to-End Sentence-Level Lipreading[C/OL].[2022-07-07]. https://arxiv.org/pdf/1611.01599.pdf. [2] HUANG Y Y, LIANG X F, FANG C W. CALLip: Lipreading Using Contrastive and Attribute Learning // Proc of the 29th ACM International Conference on Multimedia. New York, USA: ACM, 2021: 2492-2500. [3] WENG X S, KITANI K. Learning Spatio-Temporal Features with Two-Stream Deep 3D CNNs for Lipreading // Proc of the 30th British Machine Vision Conference[C/OL]. [2022-07-07].https://arxiv.org/pdf/1905.02540v1.pdf. [4] JI S W, XU W, YANG M, et al. 3D Convolutional Neural Networks for Human Action Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 35(1): 221-231. [5] CHUNG J, GULCEHRE C, CHO K, ,et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling[C/OL]. [2022-07-07]. https://arxiv.org/pdf/1412.3555.pdf. [6] GRAVES A, FERNÁNDEZ S, GOMEZ F, et al. Connectionist Tem-poral Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks // Proc of the 23rd International Confe-rence on Machine Learning. New York, USA: ACM, 2006: 369-376. [7] CHUNG J S, SENIOR A, VINYALS O, et al. Lip Reading Sentences in the Wild // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2017: 3444-3450. [8] SUTSKEVER I, VINYALS O, LE Q V. Sequence to Sequence Lear-ning with Neural Networks // Proc of the 27th International Confe-rence on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2014: 3104-3112. [9] ZHANG T, HE L, LI X D, et al. Efficient End-to-End Sentence-Level Lipreading with Temporal Convolutional Networks. Applied Sciences, 2021, 11(15): 6975-6987. [10] MA P C, MARTINEZ B, PETRIDIS S, et al. Towards Practical Lipreading with Distilled and Efficient Models // Proc of the IEEE International Conference on Acoustics, Speech and Signal Proce-ssing. Washington, USA: IEEE, 2021: 7608-7612. [11] MA P C, PETRIDIS S, PANTIC M. Visual Speech Recognition for Multiple Languages in the Wild. Nature Machine Intelligence, 2022, 4: 930-939. [12] ZHANG X B, GONG H G, DAI X L, et al. Understanding Pictograph with Facial Features: End-to-End Sentence-Level Lip Reading of Chinese. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(1): 9211-9218. [13] ZHAO Y, XU R, SONG M L. A Cascade Sequence-to-Sequence Model for Chinese Mandarin Lip Reading // Proc of the ACM Multimedia Asia. New York, USA: ACM, 2019. DOI: 10.1145/3338533.3366579. [14] DENG M H, XIONG S W. Phoneme-Based Lipreading of Silent Sentences // Proc of the IEEE Asia-Pacific Conference on Image Processing, Electronics and Computers. Washington, USA: IEEE, 2022: 206-210. [15] CHUNG J S, ZISSERMAN A. Lip Reading in the Wild // Proc of the Asian Conference on Computer Vision. Berlin, Germany: Springer, 2016: 87-103. [16] STAFYLAKIS T, TZIMIROPOULOS G. Combining Residual Networks with LSTMs for Lipreading[C/OL]. [2022-07-07].https://arxiv.org/pdf/1703.04105.pdf. [17] XU K, LI D W, CASSIMATIS N, et al. LCANet: End-to-End Lipreading with Cascaded Attention-CTC // Proc of the 13th IEEE International Conference on Automatic Face and Gesture Recognition. Washington, USA: IEEE, 2018: 548-555. [18] JEON S, ELSHARKAWY A, KIM M S. Lipreading Architecture Based on Multiple Convolutional Neural Networks for Sentence-Le-vel Visual Speech Recognition. Sensors, 2022, 22(1): 72-91. [19] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale[C/OL].[2022-07-07]. https://arxiv.org/pdf/2010.11929.pdf. [20] ZAREMBA W, SUTSKEVER I, VINYALS O. Recurrent Neural Network Regularization[C/OL]. [2022-07-07]. https://arxiv.org/pdf/1409.2329.pdf. [21] VASWANI A, SHAZEER N, PARMAR N, et al. Attention Is All You Need // Proc of the 31st International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press 2017: 6000-6010. [22] HENDRYCKS D, GIMPEL K. Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units[C/OL]. [2022-07-07].https://arxiv.org/pdf/1606.08415.pdf. [23] COOKE M, BARKER J, CUNNINGHAM S, et al. An Audio-Vi-sual Corpus for Speech Perception and Automatic Speech Recog-nition. The Journal of the Acoustical Society of America, 2006, 120(5): 2421-2424. [24] YANG S, ZHANG Y H, FENG D L, et al. LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild // Proc of the 14th IEEE International Conference on Automatic Face and Gesture Recognition. Washington, USA: IEEE, 2019. DOI: 10.1109/FG.2019.8756582 [25] BENGIO S, VINYALS O, JAITLY N, et al. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks // Proc of the 28th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2015: 1171-1179. [26] ZHAO Y, XU R, WANG X C, et al. Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(4): 6917-6924. [27] JIA J W, WANG Z L, XU L H, et al. An Interference-Resistant and Low-Consumption Lip Recognition Method. Electronics, 2022, 11(19): 3066-3081. [28] FENG D L, YANG S, SHAN S G, et al. Learn an Effective Lip Reading Model without Pains[C/OL].[2022-07-07]. https://arxiv.org/pdf/2011.07557.pdf. [29] WANG H J, PU G Q, CHEN T Y. A Lip Reading Method Based on 3D Convolutional Vision Transformer. IEEE Access, 2022, 10: 77205-77212. [30] SIMONYAN K, VEDALDI A, ZISSERMAN A. Deep Inside Con-volutional Networks: Visualising Image Classification Models and Saliency Maps[C/OL]. [2022-07-07]. https://arxiv.org/abs/1312.6034. |
|
|
|