Abstract:With the development of the hidden markov model (HMM) based speech synthesis technology, it is easy for impostors to produce synthetic speech with the specific speakers characteristics, which becomes an enormous threat to the existing speaker recognition system. In this paper, the difference between natural speech and synthetic speech is investigated on the real part of cepstrum. And a speaker recognition system is proposed which is robust against synthetic speech. Experimental results demonstrate that the false accept rate (FAR) for synthetic speech is zero in the proposed system, while that of the existing speaker recognition system is 99.2% with the equal error rate (EER) for natural speech unchanged.
[1] Reynolds D A, Quatieri T F, Dunn R B. Speaker Verification Using Adapted Gaussian Mixture Models. Digital Signal Processing, 2000, 10(1/2/3): 19-41 [2] Campbell W M, Campbell J P, Reynolds D A, et al. Support Vector Machines for Speaker and Language Recognition. Computer Speech and Language, 2006, 20(2/3): 210-229 [3] Solomonoff A, Quillen C, Campbell W M. Channel Compensation for SVM Speaker Recognition [EB/OL]. [2004-03-31]. http://www.ll.mit.edu/mission/communications.ist.publications.04053/_Solomonff.pdf [4] Kenny P, Ouellet P, Dehak N, et al. A Study of Inter-Speaker Variability in Speaker Verification. IEEE Trans on Audio, Speech and Language Processing, 2008, 16(5): 980-988 [5] Tokuda K, Yoshimura T, Masuko T, et al. Speech Parameter Generation Algorithms for HMM-Based Speech Synthesis // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Istanbul, Turkey, 2000: 1315-1318 [6] Masuko T, Tokuda K, Tobayashi T. Imposture Using Synthetic Speech against Speaker Verification Based on Spectrum and Pitch // Proc of the 6th International Conference on Spoken Language Processing. Beijing, China, 2000: 302-305 [7] Jin Qin, Toth A R, Black A W, et al. Is Voice Transformation a Threat to Speaker Identification? // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Las Vegas, USA, 2008: 4845-4848 [8] Kawahara H, Masuda-Katsuse I, de Cheveigne A. Restructuring Speech Representations Using Pitch-Adaptive Time-Frequency Smoothing and an Instantaneous-Frequency-Based F0 Extraction: Possible Role of a Repetitive Structure in Sounds. Speech Communication, 1999, 27(3/4): 187-207