Abstract:Speech with various emotions aggravates the performance of speaker recognition system. A pitch-dependent affective speech clustering method for speaker modeling is proposed. This method aims to exploiting the affective material effectively in the speaker systems. Thresholds for pitches are determined for the male and the female separately. The cepstral features in the same pitch range are clustered. Different pitch-dependent models are built with the corresponding cluster features by map adaptation for each speaker. The maximum likelihood rule is applied to the matched models and the identification of the person. The proposed method is evaluated on the mandarin affective speech corpus. Experimental results show that the proposed approach is more powerful and efficient than the cepstral feature based method and the structure training method for speaker recognition.
[1] Murray I R, Arnott J L. Synthesizing Emotions in Speech: Is It Time to Get Excited? // Proc of the 4th International Conference on Spoken Language. Philadelphia, USA, 1996, Ⅲ: 1816-1819 [2] Scherer K R. Vocal Communication of Emotion: A Review of Research Paradigms. Speech Communication, 2003, 40(1): 227-256 [3] Wu Wei, Zheng Fang, Xu Mingxing, et al. Study on Speaker Verification on Emotional Speech // Proc of the 9th International Conference on Spoken Language. Pittsburgh, USA, 2006: 1124-1127 [4] Scherer K R, Johnstone T, Klasmeyer G. Can Automatic Speaker Verification Be Improved by Training the Algorithms on Emotional Speech? // Proc of the 6th International Conference on Spoken Language. Beijing, China, 2000: 807-810 [5] Wu Tian, Yang Yingchun, Wu Zhaohui. Improving Speaker Recognition by Training on Emotion-Added Models // Proc of the 1st International Conference on Affective Computing and Intelligent Interaction. Beijing, China, 2005: 382-389 [6] Wu Zhaohui, Li Dongdong, Yang Yingchun. Rules Based Feature Modification for Affective Speaker Recognition // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Toulouse, USA, 2006: 14-19 [7] Li Dongdong, Yang Yingchun, Wu Zhaohui. Emotion-State Conversion for Speaker Recognition// Proc of the 1st International Conference on Affective Computing and Intelligent Interaction. Beijing, China, 2005: 403-410 [8] Wang Haibo, Li Aijun, Fang Qiang. F0 Contour of Prosodic Word in Happy Speech of Mandarin// Proc of the 1st International Conference on Affective Computing and Intelligent Interaction. Beijing, China, 2005: 433-440 [9] Tao Jianhua, Kang Yongguo. Feature Importance Analysis for Emotional Speech Classification// Proc of the 1st International Conference on Affective Computing and Intelligent Interaction. Beijing, China, 2005: 449-457 [10] Reynolds D A, Rose R C. Robust Text-Independent Speaker Identification Using Gaussian Mixture Speaker Models. IEEE Trans on Speech and Audio Processing, 1995, 3(1): 72-83 [11] Zhang Junying. Modern Methods and Techniques for Speaker Recognition. Xi'an, China: Northwest Polytechnical University Press, 1994 (in Chinese) (张军英.说话人识别的现代方法与技术.西安:西北工业大学出版社, 1994) [12] Ezzaidi H, Rouat J, OShaughnessy D. Towards Combining Pitch and MFCC for Speaker Identification Systems // Proc of the 7th European Conference on Speech Communication and Technology. Alborg, Denmark, 2001: 2825-2828 [13] Cai Lianhong, Cui Dandan, Jiang Danning, et al. Analysis on Expressivity and Acoustic Correlation of Speech. Technical Acoustics, 2005, 26(3): 209-212 (in Chinese) (蔡莲红,崔丹丹,蒋丹宁,等.语音的情感信息分析与编辑.声学技术, 2005, 26(3): 209-212) [14]Nobuaki M, Seiichi N. Modeling of Variations in Cepstral Coefficients Caused by F0 Changes and Its Application to Speech Processing // Proc of the 5th International Conference on Spoken Language Processing. Sydney, Australia, 1998: 1063-1066 [15] Yang Pu, Yang Yingchun, Wu Zhaohui. Recuperating Spectral Features Using Glottal Information and Its Application to Speaker Recognition // Proc of the International Joint Conference on Neural Networks. Budapest, Hungary, 2004: 2943-2946 [16] Gauvan J L, Lee C H. Maximum a Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains. IEEE Trans on Speech and Audio Processing, 1994, 2(2): 291-298 [17] Parris E S, Carey M J. Language Independent Gender Identification // Proc of the IEEE International Conference on Acoustics, Speech, and Signal Processing. Atlanta, USA, 1997, Ⅱ: 2307-2310 [18] Harb H, Chen Liming. Gender Identification Using a General Audio Classifier // Proc of the International Conference on Multimedia and Expo. Baltimore, USA, 1996: 685-688 [19] Wu Tian, Yang Yingchun, Wu Zhaohui, et al. MASC: A Speech Corpus in Mandarin for Emotion Analysis and Affective Speaker Recognition // Proc of the Workshop on Speaker and Language Recognition. San Juan, USA, 2006: 1-5