基于关键点表示的语音驱动说话人脸视频生成

doi:10.16451/j.cnki.issn1003-6059.202106009

Abstract
Figure/Table
References
Related Citation (15)

Download: PDF (2728 KB) HTML (1 KB)
Export: BibTeX | EndNote (RIS)

Abstract The speaker's head motion is ignored in the existing speech driven talking face video generation methods. Aiming at this problem, a speech driven talking face video generation method based on facial landmarks representation is proposed. The speaker's head motion information and lip motion information are represented by facial contour landmarks and lip landmarks, respectively. The speech is converted to facial landmarks through a parallel multi-branch network. The final talking face video is synthesized by continuous lip landmark sequence, head landmark sequence and template image. The corresponding quantitative and qualitative experiments are conducted. Experimental results show that the talking face video with head action synthesized by the proposed method is clear and natural, and its performance is better.

Key words： Talking Face Facial Landmark Lip Action Head Action Face Video

Received: 03 March 2021

ZTFLH:

TP 391.4

Fund:University Synergy Innovation Program of Anhui Province(No.GXXT-2019-007), National Natural Science Foundation of China(No.61902104), Natural Science Foundation of Anhui Province(No.2008085QF295), University Natural Science Research Project of Anhui Province(No.KJ2020A0651), Talent Research Foundation of Hefei University(No.18-19RC54)

Corresponding Authors: NIAN Fudong, Ph.D., associate professor. His research interests include computer vision and multimedia computing.

About author:: WANG Wentao, master student. His research interests include image generation.
WANG Yan, Ph.D. candidate. Her research interests include convolution neural network and multimodal fusion.
ZHANG Jingjing, Ph.D., associate professor. Her research interests include compu-ter vision.
HU Guiheng, master, lecturer. His research interests include software technology and artificial intelligence.
LI Teng, Ph.D., professor. His research interests include computer vision and multimedia computing.

	Service

	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	NIAN Fudong
	WANG Wentao
	WANG Yan
	ZHANG Jingjing
	HU Guiheng
	LI Teng

Cite this article:

NIAN Fudong,WANG Wentao,WANG Yan等. Speech Driven Talking Face Video Generation via Landmarks Representation[J]. , 2021, 34(6): 572-580.

URL:

http://manu46.magtech.com.cn/Jweb_prai/EN/10.16451/j.cnki.issn1003-6059.202106009 OR http://manu46.magtech.com.cn/Jweb_prai/EN/Y2021/V34/I6/572

[1] LI T Y, BOLKART T, BLACK M J, et al. Learning a Model of Facial Shape and Expression from 4D Scans. ACM Transactions on Graphics, 2017, 36(6): 194:1-194:17.
[2] CUDEIRO D, BOLKART T, LAIDLAW C, et al. Capture, Lear-ning, and Synthesis of 3D Speaking Styles[C/OL]. [2021-03-01]. https://arxiv.org/pdf/1905.03079v1.pdf.
[3] PHAM H X, WANG Y T, PAVLOVIE V. End-to-End Learning for 3D Facial Animation from Speech[C/OL]. [2021-03-01]. https://arxiv.org/pdf/1710.00920.pdf.
[4] KARRAS T, AILA T, LAINE S, et al. Audio-Driven Facial Animation by Joint End-to-End Learning of Pose and Emotion. ACM Transactions on Graphics, 2017, 36(4): 94: 1-94: 12.
[5] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet Classification with Deep Convolutional Neural Networks. Communications of the ACM, 2017, 60(6): 84-90.
[6] ZAREMBA W, SUTSKEVER I, VINYALS O. Recurrent Neural Network Regularization[C/OL]. [2021-03-01]. https://arxiv.org/pdf/1409.2329.pdf.
[7] GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al. Ge-nerative Adversarial Nets // Proc of the 27th International Confe-rence on Neural Information Processing Systems. Cambridge, USA: The MIT Press, 2014, II: 2672-2680.
[8] CHUNG J S, JAMALUDIN A, ZISSERMAN A. You Said That? [C/OL]. [2021-03-01]. https://arxiv.org/pdf/1705.02966.pdf.
[9] VOUGIOUKAS K, PETRIDIS S, PANTIC M. End-to-End Speech-Driven Realistic Facial Animation with Temporal GANs // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2019: 37-40.
[10] CHEN L L, MADDOX R K, DUAN Z Y, et al. Hierarchical Cross-Modal Talking Face Generation with Dynamic Pixel-Wise Loss // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2019: 7824-7833.
[11] MITTAL G, WANG B Y. Animating Face Using Disentangled Audio Representations // Proc of the IEEE/CVF Winter Conference on Applications of Computer Vision. Washington, USA: IEEE, 2020: 3290-3298.
[12] YI R, YE Z P, ZHANG J Y, et al. Audio-Driven Talking Face Video Generation with Natural Head Pose[C/OL]. [2021-03-01]. https://arxiv.org/pdf/2002.10137v1.pdf.
[13] SONG L S, WU W N, QIAN C, et al. Everybody's Talkin': Let Me Talk as You Want[C/OL]. [2021-03-01]. https://arxiv.org/pdf/2001.05201.pdf.
[14] DENG Y, YANG J L, XU S C, et al. Accurate 3D Face Reconstruction with Weakly-Supervised Learning: From Single Image to Image Set // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Washington, USA: IEEE, 2019: 285-295.
[15] SUWAJANAKORN S, SEITZ S M, SEMELMACHER-SHLIZERMAN I. Synthesizing Obama: Learning Lip Sync from Audio. ACM Transactions on Graphics, 2017, 36(4): 95: 1-95: 13.
[16] CHUNG J S, SENIOR A, VINYALS O, et al. Lip Reading Sentences in the Wild // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2017: 3444-3453.
[17] CHEN L L, LI Z H, MADDOX R K, et al. Lip Movements Gene-ration at a Glance // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 538-553.
[18] CHO K, VAN MERRIËNBOER B, BAHDANAU D, et al. On the Properties of Neural Machine Translation: Encoder-Decoder Approaches // Proc of the 8th Workshop on Syntax, Semantics and Structure in Statistical Translation. Stroudsburg, USA: ACL, 2014: 103-111.
[19] FENG Z H, KITTLER J, AWAIS M, et al. Wing Loss for Robust Facial Landmark Localisation with Convolutional Neural Networks // Proc of the IEEE/CVF Conference on Computer Vision and Pa-ttern Recognition. Washington, USA: IEEE, 2018: 2235-2245.
[20] RONNEBERGER O, FISCHER P, BROX T. U-Net: Convolutional Networks for Biomedical Image Segmentation // Proc of the International Conference on Medical Image Computing and Computer-Assisted Intervention. Berlin, Germany: Springer, 2015: 234-241.
[21] HU J, SHEN L, ALBANIE S, et al. Squeeze-and-Excitation Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(8): 2011-2023.
[22] COOKE M, BARKER J, CUNNINGHAM S, et al. An Audio-Vi-sual Corpus for Speech Perception and Automatic Speech Recog-nition. The Journal of the Acoustical Society of America, 2006, 120(5): 2421-2424.
[23] WANG Z, BOVIK A C, SHEIKH H R, et al. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Transactions on Image Processing, 2004, 13(4): 600-612.
[24] WILES O, KOEPKE A S, ZISSERMAN A. X2Face: A Network for Controlling Face Generation Using Images, Audio, and Pose Codes // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 690-706.