Abstract:A statistical acoustic model based unit selection algorithm for speech synthesis is proposed. During training stage, the acoustic models for contextual dependent phonemes are built up by using acoustic features extracted from the training data, such as spectral parameters, F0, and segmental and prosodic labels in the corpus. The hidden Markov model (HMM) is adopted as the model structure. During synthesis stage, the optimal phoneme unit sequence is searched in the speech corpus by maximizing the probabilistic likelihood between its acoustic features and the sentence HMM constructed with the contextual information of input text. Finally, the waveforms of the selected candidate units are concatenated and smoothed to produce the synthesized speech. Based on the proposed method, a Chinese speech synthesis system using initials and finals as the basic concatenation units is constructed. Results of listening test prove that the proposed method can achieve better naturalness of synthesized speech compared to the conventional method.
凌震华,王仁华. 基于统计声学模型的单元挑选语音合成算法*[J]. 模式识别与人工智能, 2008, 21(3): 280-284.
LING Zhen-Hua, WANG Ren-Hua. Statistical Acoustic Model Based Unit Selection Algorithm for Speech Synthesis. , 2008, 21(3): 280-284.
[1] Hunt A J, Black A W. Unit Selection in a Concatenative Speech Synthesis System Using a Large Speech Database // Proc of the International Conference on Acoustics, Speech, Signal Processing. Atlanta, USA, 1996: 373-376 [2] Wang Renhua, Ma Zhongke, Li Wei, et al. A Corpus-Based Chinese Speech Synthesis with Contextual Dependent Unit Selection // Proc of the 6th International Conference on Spoken Language Processing. Beijing, China, 2000, Ⅱ: 391-394 [3] Yoshimura T, Tokuda K, Masuko T, et al. Simultaneous Modeling of Spectrum, Pitch and Duration in HMM-Based Speech Synthesis // Proc of the 6th European Conference on Speech Communication and Technology. Budapest, Hungary, 1999, Ⅴ: 2347-2350 [4] Tokuda K, Yoshimura T, Masuko T, et al. Speech Parameter Generation Algorithms for HMM-Based Speech Synthesis // Proc of the International Conference on Acoustics, Speech, Signal Processing. Istanbul, Turkey, 2000, Ⅲ: 1315-1318 [5] Wu Yijian, Wang Renhua. HMM-Based Trainable Speech Synthesis for Chinese. Journal of Chinese Information Processing, 2006, 20(4): 75-81 (in Chinese) (吴义坚,王仁华.基于HMM的可训练中文语音合成.中文信息学报, 2006, 20(4): 75-81) [6] Ling Zhenhua, Wu Yijian, Wang Yuping, et al. USTC System for Blizzard Challenge 2006 an Improved HMM-Based Speech Synthesis Method [EB/OL]. [2006-07-21]. http://festvox.org/blizzard/bc2006/ustc-blizzard2006.pdf [7] Fukada T, Tokuda K, Kobayashi T, et al. An Adaptive Algorithm for Mel-Cepstral Analysis of Speech // Proc of the International Conference on Acoustics, Speech, Signal Processing. San Francisco, USA, 1992, Ⅰ: 137-140 [8] Tokuda K, Masuko T, Miyazaki N, et al. Hidden Markov Models Based on Multi-Space Probability Distribution for Pitch Pattern Modeling // Proc of the International Conference on Acoustics, Speech, Signal Processing. Phoenix, USA, 1999, Ⅰ: 229-232 [9] Zhao Yong, Liu Peng, Li Yusheng, et al. Measuring Target Cost in Unit Selection with KL-Divergence between Context-Dependent HMMs // Proc of the International Conference on Acoustics, Speech, Signal Processing. Toulouse, France, 2006, Ⅰ: 725-728 [10] Hirai T, Tenpaku S. Using 5ms Segments in Concatenative Speech Synthesis // Proc of the 5th ISCA Speech Synthesis Workshop. Pittsburgh, USA, 2004: 37-42