Abstract:A restricted Boltzmann machine based spectrum modeling and unit selection speech synthesis method is proposed. At the model training stage, the restricted Boltzmann machine is used to model spectral features with rich details, such as spectral envelopes and short-time spectral amplitudes, instead of using the single Gaussian model with diagonal variance and mel-cepstrum feature for spectral model in the traditional approach. Thus, the description capability of the acoustical model for spectral feature is improved. At the speech synthesis stage, the restricted Boltzmann machine model is adopted to calculate the log likelihoods of spectral feature of candidate sample, and a method of piecewise linear mapping is proposed to construct target cost function for unit selection. The experimental results indicate that the proposed method can effectively improve the naturalness of synthetic speech.
[1] Mizutani T, Kagoshima T. Concatenative Speech Synthesis Based on the Plural Unit Selection and Fusion Method. IEICE Trans on Information and Systems, 2005, 88(11): 2565-2572 [2] Gros J Z, Zganec M. An Efficient Unit-Selection Method for Conca-tenative Text-to-Speech Synthesis Systems. Journal of Computing and Information Technology, 2008, 16(1): 69-78 [3] Ling Z H, Wang R H. HMM-Based Unit Selection Combining Kullback-Leibler Divergence with Likelihood Criterion // Proc of the International Conference on Acoustics, Speech and Signal Processing. Honolulu, USA, 2007, IV: 1245-1248 [4] Wang R H, Dai L R, Ling Z H, et al. Trainable Unit Selection Speech Synthesis under Statistical Framework. Chinese Science Bu-lletin, 2009, 54(8): 1133-1138 (in Chinese) (王仁华,戴礼荣,凌震华,等.基于统计建模的可训练单元挑选语音合成方法.科学通报, 2009, 54(8): 1133-1138) [5] Ling Z H, Wang R H. Statistical Acoustic Model Based Unit Selection Algorithm for Speech Synthesis. Pattern Recognition and Artificial Intelligence, 2008, 21(3): 280-284 (in Chinese) (凌震华,王仁华.基于统计声学模型的单元挑选语音合成算法.模式识别与人工智能, 2008, 21(3): 280-284) [6] Ling Z H, Lu H, Hu G P, et al. The USTC System for Blizzard Challenge 2008[EB/OL]. [2014-04-01]. http://www.festvox.org/blizzard/bc2008/ustc_Blizzard2008.pdf [7] Hinton G E, Salakhutdinov R R. Reducing the Dimensionality of Data with Neural Networks. Science, 2006, 313(5786): 504-507 [8] Ling Z H, Li D, Yu D. Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis. IEEE Trans on Audio, Speech, and Language Processing, 2013, 21(10): 2129-2139 [9] Kawahara H, Masuda-Katsuse I, de Cheveigné A. Restructuring Speech Representations Using a Pitch-Adaptive Time-Frequency Smoothing and an Instantaneous-Frequency-Based F0 Extraction: Possible Role of a Repetitive Structure in Sounds. Speech Communication, 1999, 27(3/4): 187-207 [10] Tokuda K, Masuko T, Miyakazi N, et al. Multi-space Probability Distribution HMM. IEICE Trans on Information and Systems, 2002, E85-D(3): 455-464 [11] Ling Z H, Wang Z G, Dai L R, et al. Statistical Modeling of Syllable-Level F0 Features for HMM-Based Unit Selection Speech Synthesis // Proc of the 7th International Symposium on Chinese Spoken Language Processing. Tainan, China, 2010: 144-147 [12] Salakhutdinov R. Learning Deep Generative Models. Ph.D Dissertation. Toronto, Canada: University of Toronto, 2009 [13] Hinton G E. Training Products of Experts by Minimizing Contrastive Divergence. Neural Computation, 2002, 14(8): 1771-1800