模式识别与人工智能
Saturday, May. 3, 2025 Home      About Journal      Editorial Board      Instructions      Ethics Statement      Contact Us                   中文
Pattern Recognition and Artificial Intelligence  2022, Vol. 35 Issue (12): 1111-1121    DOI: 10.16451/j.cnki.issn1003-6059.202212006
Deep Learning Based Image Understanding and Its Applications Current Issue| Next Issue| Archive| Adv Search |
Chinese Lipreading Network Based on Vision Transformer
XUE Feng1, HONG Zikun2, LI Shujie1, LI Yu2, XIE Yincen2
1. School of Software, Hefei University of Technology, Hefei 230601;
2. School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230601

Download: PDF (1611 KB)   HTML (1 KB) 
Export: BibTeX | EndNote (RIS)      
Abstract  Lipreading is a multimodal task to convert lipreading videos into text, and it is intended to understand the meaning expressed by a speaker in the absence of sound. In the existing lipreading methods, convolutional neural networks are adopted to extract visual features of the lips and capture short-distance pixel relationships, resulting in difficulties in distinguishing lip shapes of similarly pronounced characters. To capture the long-distance relationship between pixels in the lip region of the video images, an end-to-end Chinese sentence-level lipreading model based on vision transformer(ViT) is proposed. The ability of the model to extract visual spatio-temporal features from lip videos is improved by fusing ViT and Gate Recurrent Unit(GRU). Firstly, the global spatial features of lip images are extracted using the self-attention module of ViT. Then, GRU is employed to model the temporal sequence of frames. Finally, the cascading sequence-to-sequence model based on the attention mechanism is utilized to predict Chinese pinyin and Chinese character utterances. Experimental results on Chinese lipreading dataset CMLR show that the proposed model produces a lower Chinese character error rate.
Key wordsLipreading      Vision Transformer(ViT)      Deep Neural Network      Encoder-Decoder      Attention Mechanism      Feature Extraction     
Received: 08 July 2022     
ZTFLH: TP391.41  
Fund:National Natural Science Foundation of China(No.62272143), University Synergy Innovation Program of Anhui Province(No.GXXT-2022-054), Anhui Provincial Major Science and Technology Project(No.202203a05020025), The Se-venth Special Support Plan for Innovation and Entrepreneurship in Anhui Province
Corresponding Authors: XUE Feng, Ph.D., professor. His research interests include artificial intelligence, multimedia analysis and recommendation system.   
About author:: HONG Zikun, master student. His research interests include computer vision.LI Shujie, Ph.D., lecturer. Her research interests include computer vision and human pose estimation.LI Yu, Ph.D. candidate. Her research interests include computer vision.XIE Yincen, master student. His research interests include computer vision.
Service
E-mail this article
Add to my bookshelf
Add to citation manager
E-mail Alert
RSS
Articles by authors
XUE Feng
HONG Zikun
LI Shujie
LI Yu
XIE Yincen
Cite this article:   
XUE Feng,HONG Zikun,LI Shujie等. Chinese Lipreading Network Based on Vision Transformer[J]. Pattern Recognition and Artificial Intelligence, 2022, 35(12): 1111-1121.
URL:  
http://manu46.magtech.com.cn/Jweb_prai/EN/10.16451/j.cnki.issn1003-6059.202212006      OR     http://manu46.magtech.com.cn/Jweb_prai/EN/Y2022/V35/I12/1111
Copyright © 2010 Editorial Office of Pattern Recognition and Artificial Intelligence
Address: No.350 Shushanhu Road, Hefei, Anhui Province, P.R. China Tel: 0551-65591176 Fax:0551-65591176 Email: bjb@iim.ac.cn
Supported by Beijing Magtech  Email:support@magtech.com.cn