基于感知加权线谱对距离的最小生成误差语音合成模型训练方法

Abstract
Figure/Table
References
Related Citation (5)

Download: PDF (608 KB) HTML (1 KB)
Export: BibTeX | EndNote (RIS)

Abstract A Minimum Generation Error (MGE) training method based on perceptually weighted Line Spectral Pair (LSP) distance is proposed to improve the performance of Hidden Markov Model (HMM) based parametric speech synthesis system. The generation error defined by Euclidean distance used in the traditional MGE training, is not eligible in measuring the real gap between generated spectrum and natural spectrum when the speech spectrum is described by LSP. Although using generation error defined by Log Spectral Distortion (LSD) having nothing to do with spectrum parameters manages to deal with this problem, the improvement seems trivial compared to the incurred higher computational complexity. In this paper, an MGE training criterion based on weighted LSP distance is proposed, and this MGE training method is subjectively and objectively contrasted with different weighted methods and LSD based MGE training method. Eventually, a perceptually weighted training method is obtained, which not only achieves the best performance, but also incurs no extra computational complexity compared with the traditional MGE training.

Key words： Speech Synthesis Hidden Markov Model (HMM) Minimum Generation Error (MGE) Perceptually Weighting Line Spectral Pair Parameter

Received: 07 February 2009

ZTFLH:

TN912.33

	Service

	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	LEI Ming
	LING Zhen-Hua
	DAI Li-Rong

Cite this article:

LEI Ming,LING Zhen-Hua,DAI Li-Rong. Minimum Generation Error Training Based on Perceptually Weighted Line Spectral Pair Distance for Statistical Parametric Speech Synthesis[J]. , 2010, 23(4): 572-579.

URL:

http://manu46.magtech.com.cn/Jweb_prai/EN/ OR http://manu46.magtech.com.cn/Jweb_prai/EN/Y2010/V23/I4/572

[1] Masuko T, Tokuda K, Kobayashi T, et al. Speech Synthesis Using HMMs with Dynamic Features // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Atlanta, USA, 1996, Ⅰ: 389-392
[2] Yoshimura T, Tokuda K, Masuko T, et al. Simultaneous Modeling of Spectrum, Pitch and Duration in HMM-Based Speech Synthesis // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Phoenix, USA, 1999, Ⅴ: 2347-2350
[3] Tokuda K, Kobayashi T, Imai S. Speech Parameter Generation from HMM Using Dynamic Features // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Detroit, USA, 1995, Ⅰ: 660-663
[4] Ling Zhenhua, Qin Long, Lu Heng, et al. The USTC and iFlytek Speech Synthesis Systems for Blizzard Challenge 2007 // Proc of the Blizzard Challenge Workshop. Bonn, Germany, 2007: 17-21
[5] Zen H, Toda T. An Overview of Nitech HMM-Based Speech Synthesis System for Blizzard Challenge 2005 // Proc of the 9th European Conference on Speech Communication and Technology. Lisbon, Portugal, 2005: 93-96
[6] Wu Yijian, Wang Renhua. Minimum Generation Error Training for HMM-Based Speech Synthesis // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Toulouse, France, 2006, Ⅰ: 889-892
[7] Wu Yijian, Guo Wu, Wang Renhua. Minimum Generation Error Criterion for Tree-Based Clustering of Context Dependent HMMs // Proc of the 9th International Conference on Speken Language Processing. Pittsburgh, USA, 2006: 2046-2049
[8] Qin Long, Wu Yijian, Ling Zhenhua, et al. Minimum Generation Error Linear Regression Based Model Adaptation for HMM-Based Speech Synthesis // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Las Vegas, USA, 2008: 3953-3956
[9] McLoughlin I V. Line Spectral Pairs. Signal Processing Journal, 2008, 88(3): 448-467
[10] Wu Yijian, Wang Renhua. HMM-Based Trainable Speech Synthesis for Chinese. Journal of Chinese Information Processing, 2006, 20(4): 75-81 (in Chinese)
(吴义坚,王仁华.基于HMM的可训练中文语音合成.中文信息学报, 2006, 20(4): 75-81)
[11] Wu Yijian, Tokuda K. Minimum Generation Error Training with Direct Log Spectral Distortion on LSPs for HMM-Based Speech Synthesis // Proc of the 9th Annual Conference of the International Speech Communication Association. Brisbane, Australia, 2008: 577-580
[12] Lee M S, Kim H K, Lee H S. A New Distortion Measure for Spectral Quantization Based on the LSF Intermodel Interlacing Property. Speech Communication, 2001, 35(3/4): 191-202
[13] Laroia R, Phamdo N, Farvardin N. Robust and Efficient Quantization of Speech LSP Parameters Using Structured Vector Quantizers // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Toronto, Canada, 1991, Ⅰ: 641-644
[14] Gardner W R, Rao B D. Theoretical Analysis of the High-Rate Vector Quantization of LPC Parameters. IEEE Trans on Speech and Audio Processing, 1995, 3(5): 367-381
[15] Kim H K, Lee H S. Interlacing Properties of Line Spectrum Pair Frequencies. IEEE Trans on Speech and Audio Processing, 1999, 7(1): 87-91
[16] Ling Zhenhua, Wu Yijian, Wang Yuping, et al. USTC System for Blizzard Challenge 2006 an Improved HMM-Based Speech Synthesis Method [EB/OL]. [2006-09-16]. http:// citeseerx.ist.psu.edu/viewdoc/downlood?doi=10.1.1.130.7143rep=rep1type=pdf