Abstract:To improve the decision tree clustering and avoid possible clustered model over-training and less-training, a minimal generation error criterion and cross-validation (CV) based minimal description length factor optimizing method is introduced. CV based generation error is calculated to optimize the scale of the decision tree. Results of both subjective and objective tests show that synthesized speech by the proposed method outperforms the synthesized speech by the baseline one system in both quality and naturalness.
[1] Hunt A J, Black A W. Unit Selection in a Concatenative Speech Synthesis System Using a Large Speech Database // Proc of the IEEE International Conference on Acoustics, Speech and Signal Process. Atlanta, USA, 1996: 373-376 [2] Tokuda K, Yoshimura T, Masuko T, et al. Speech Parameter Generation Algorithms for HMM-Based Speech Synthesis // Proc of the IEEE International Conference on Acoustics, Speech and Signal Process. Istanbul, Turkey, 2000, Ⅲ: 1315-1318 [3] Yoshimura T, Tokuda K, Masuko T, et al. Simultaneous Modeling of Spectrum, Pitch and Duration in HMM-Based Speech Synthesis // Proc of the 6th European Conference on Speech Communication and Technology. Budapest, Hungary, 1999, Ⅴ: 2347-2350 [4] Tokuda K, Masuko T, Miyazaki N, et al. Hidden Markov Models Based on Multi-Space Probability Distribution for Pitch Pattern Modeling // Proc of the IEEE International Conference on Acoustics, Speech and Signal Process. Phoenix, USA, 1999: 229-232 [5] Shinoda K, Watanabe T. MDL-Based Context-Dependent Subword Modeling for Speech Recognition. Acoustical Science and Technology, 2000, 21(2): 79-86 [6] Wu Yijian, Wang Renhua. HMM-Based Trainable Speech Synthesis for Chinese. Journal of Chinese Information Processing, 2006, 20(4): 75-81 (in Chinese) (吴义竖,王仁华.基于HMM的可训练中文语音合成.中文信息学报, 2006, 20(4): 75-81) [7] Wu Yijian, Wang Renhua. Minimum Generation Error Training for HMM-Based Speech Synthesis // Proc of the IEEE International Conference on Acoustics, Speech and Signal Process. Toulouse, France, 2006: 89-92 [8] Kawahara H, Masuda-Katsuse I, de Chveigné A. Restructuring Speech Representations Using a Pitch-Adaptive Time Frequency Smoothing and a Instantaneous-Frequency-Based F0 Extraction: Possible Role of a Repetitive Structure in Sounds. Speech Communication, 1999, 27(3/4): 187-207 [9] Tokuda K, Zen H, Yamagishi J, et al. The HMM-Based Speech Synthesis System (HTS) [EB/OL]. [2009-06-01]http://hts.sp.nitech.ac.jp/ [10] Laroia R, Phamdo N, Farvardin N. Robust and Efficient Quantization of Speech LSP Parameters Using Structured Vector Quantizers // Proc of the International Conference on Acoustics, Speech and Signal Processing. Toronto, Canada, 1991: 641-644 [11] Paliwal K, Atal B S. Efficient Vector Quantization of LPC Parameters at 24bits/frame. IEEE Trans on Speech and Audio Processing, 1993, 1(1): 3-14