基于受限玻尔兹曼机的频谱建模与单元挑选语音合成方法<sup>*</sup>

doi:10.16451/j.cnki.issn1003-6059.201508001

摘要
图/表
参考文献
相关文章 (15)

全文: PDF (467 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要提出基于受限玻尔兹曼机的频谱建模与单元挑选语音合成方法.在模型训练阶段，采用受限玻尔兹曼机对包含丰富细节的频谱特征建模，如谱包络、短时幅度谱，取代传统的使用对角方差单高斯模型和梅尔倒谱特征的频谱建模方法，改善声学模型对于频谱特征的描述能力.在语音合成阶段，使用训练得到的受限玻尔兹曼机模型计算备选样本频谱特征的对数似然值，并通过分段线性映射构建单元挑选的目标代价函数.实验表明文中方法可有效提高合成语音的自然度.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	宋阳
	凌震华
	戴礼荣

关键词 ：语音合成, 单元挑选, 隐马尔可夫模型, 受限玻尔兹曼机

Abstract：A restricted Boltzmann machine based spectrum modeling and unit selection speech synthesis method is proposed. At the model training stage, the restricted Boltzmann machine is used to model spectral features with rich details, such as spectral envelopes and short-time spectral amplitudes, instead of using the single Gaussian model with diagonal variance and mel-cepstrum feature for spectral model in the traditional approach. Thus, the description capability of the acoustical model for spectral feature is improved. At the speech synthesis stage, the restricted Boltzmann machine model is adopted to calculate the log likelihoods of spectral feature of candidate sample, and a method of piecewise linear mapping is proposed to construct target cost function for unit selection. The experimental results indicate that the proposed method can effectively improve the naturalness of synthetic speech.

Key words： Speech Synthesis Unit Selection Hidden Markov Model Restricted Boltzmann Machine

收稿日期: 2014-04-25

ZTFLH:

TN 912.33

基金资助:国家自然科学基金项目(No.61273032)资助

作者简介: 宋阳，男，1989年生，硕士研究生，主要研究方向为语音合成.E-mail:ysong@mail.ustc.edu.cn.凌震华(通讯作者)，男，1979年生，博士，副教授，主要研究方向为语音合成、说话人转换.E-mail:zhling@ustc.edu.cn.戴礼荣，男，1962年生，博士，教授，主要研究方向为语音信息处理、人机语音通信.

引用本文:

宋阳，凌震华，戴礼荣. 基于受限玻尔兹曼机的频谱建模与单元挑选语音合成方法^*[J]. 模式识别与人工智能, 2015, 28(8): 673-679. SONG Yang, LING Zhen-Hua, DAI Li-Rong. Restricted Boltzmann Machine Based Spectrum Modeling and Unit Selection Speech Synthesis Method. , 2015, 28(8): 673-679.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/10.16451/j.cnki.issn1003-6059.201508001 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2015/V28/I8/673

[1] Mizutani T, Kagoshima T. Concatenative Speech Synthesis Based on the Plural Unit Selection and Fusion Method. IEICE Trans on Information and Systems, 2005, 88(11): 2565-2572
[2] Gros J Z, Zganec M. An Efficient Unit-Selection Method for Conca-tenative Text-to-Speech Synthesis Systems. Journal of Computing and Information Technology, 2008, 16(1): 69-78
[3] Ling Z H, Wang R H. HMM-Based Unit Selection Combining Kullback-Leibler Divergence with Likelihood Criterion // Proc of the International Conference on Acoustics, Speech and Signal Processing. Honolulu, USA, 2007, IV: 1245-1248
[4] Wang R H, Dai L R, Ling Z H, et al. Trainable Unit Selection Speech Synthesis under Statistical Framework. Chinese Science Bu-lletin, 2009, 54(8): 1133-1138 (in Chinese)
(王仁华,戴礼荣,凌震华,等.基于统计建模的可训练单元挑选语音合成方法.科学通报, 2009, 54(8): 1133-1138)
[5] Ling Z H, Wang R H. Statistical Acoustic Model Based Unit Selection Algorithm for Speech Synthesis. Pattern Recognition and Artificial Intelligence, 2008, 21(3): 280-284 (in Chinese)
(凌震华,王仁华.基于统计声学模型的单元挑选语音合成算法.模式识别与人工智能, 2008, 21(3): 280-284)
[6] Ling Z H, Lu H, Hu G P, et al. The USTC System for Blizzard Challenge 2008[EB/OL]. [2014-04-01]. http://www.festvox.org/blizzard/bc2008/ustc_Blizzard2008.pdf
[7] Hinton G E, Salakhutdinov R R. Reducing the Dimensionality of Data with Neural Networks. Science, 2006, 313(5786): 504-507
[8] Ling Z H, Li D, Yu D. Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis. IEEE Trans on Audio, Speech, and
Language Processing, 2013, 21(10): 2129-2139
[9] Kawahara H, Masuda-Katsuse I, de Cheveigné A. Restructuring Speech Representations Using a Pitch-Adaptive Time-Frequency Smoothing and an Instantaneous-Frequency-Based F0 Extraction: Possible Role of a Repetitive Structure in Sounds. Speech Communication, 1999, 27(3/4): 187-207
[10] Tokuda K, Masuko T, Miyakazi N, et al. Multi-space Probability Distribution HMM. IEICE Trans on Information and Systems, 2002, E85-D(3): 455-464
[11] Ling Z H, Wang Z G, Dai L R, et al. Statistical Modeling of Syllable-Level F0 Features for HMM-Based Unit Selection Speech Synthesis // Proc of the 7th International Symposium on Chinese Spoken Language Processing. Tainan, China, 2010: 144-147
[12] Salakhutdinov R. Learning Deep Generative Models. Ph.D Dissertation. Toronto, Canada: University of Toronto, 2009
[13] Hinton G E. Training Products of Experts by Minimizing Contrastive Divergence. Neural Computation, 2002, 14(8): 1771-1800