Abstract:Cross-language model adaptation in statistical parametric speech synthesis is used for rapidly constructing a text-to-speech (TTS) system with the target speakers characteristics when the source and the target speakers languages are different. In this paper, the conventional cross-language adaptation method based on phone-mapping and triphone models is improved by two means. Firstly, phone mapping combined with data-selection is adopted to improve its reliability. Secondly, cross-language prosodic information mapping is introduced to make use of prosodic information, which is ignored in the triphone model. Experiments on Chinese-to-English adaptation show that the synthesized speech using the improved method has much better naturalness and speaker similarity compared with the result of conventional method.
[1] Tokuda K,Zen H,Black A W.HMM-Based Approach to Multilingual Speech Synthesis // Narayanan S,Alwan A,eds.Text to Speech Synthesis: New Paradigms and Advances.Upper Saddle River,USA: Prentice-Hall,2004: 135-153 [2] Leggetter C J,Woodland P C.Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density Hidden Markov Models.Computer Speech and Language,1995,9(2): 171-185 [3]Latorre J,Iwano K,Furui S.New Approach to the Polyglot Speech Generation by Means of an HMM-Based Speaker Adaptable Synthesizer.Speech Communication,2006: 48(10): 1227-1242 [4] Wu Y,Nankaku Y,Tokuda K.State Mapping Based Method for Cross-Lingual Speaker Adaptation in HMM-Based Speech Synthesis // Proc of the 10th Annual Conference of the International Speech Communication Association.Brighton,UK,2009: 528-531 [5] Gibson M,Hirsimaki T,Karhila R,et al.Unsupervised Cross-Lingual Speaker Adaptation for HMM-Based speech Synthesis Using Two-Pass Decision Tree Construction // Proc of the IEEE International Conference on Acoustics Speech and Signal Processing.Dallas,USA,2010: 4641-4645 [6] Wu Y,King S,Tokuda K.Cross-Lingual Speaker Adaptation for HMM-Based Speech Synthesis // Proc of the International Symposium on Chinese Spoken Language.Kunming,China,2008: 9-12 [7] Gales M J F.The Generation and Use of Regression Class Trees for MLLR Adaptation.Technical Report,CUED/F-INFENG/TR263.Engineering Department,Cambridge University.Cambridge,UK,1996 [8] International Phonetic Association.Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet.London,UK:Cambridge University Press,1999 [9] Kawahara H,Masuda-Katsuse I,deCheveigne A.Restructuring Speech Representations Using A Pitch-Adaptive Time-Frequency Smoothing and an Instanta-Neous-Frequency-Based F0 Extraction: Possible Role of a Repetitive Structure in Sounds.Speech Communication,1999,27(3/4): 187-207