|
|
Deep Belief Network Based Speaker Information Extraction Method |
CHEN Li-Ping1, WANG Er-Yu2, DAI Li-Rong1, SONG Yan1 |
1.Department of Electronic Engineering and Information Science, University of Science and Technology of China,Hefei 230027 2.Tencent, Inc., Beijing 100080 |
|
|
Abstract In i-vector based speaker verification system, it is necessary to extract the discriminative speaker information from i-vectors to further improve the performance of the system. Combined with the anchor model, a deep belief network based speaker-related information extraction method is proposed in this paper. By analyzing and modeling the complex variabilities contained in i-vectors layer-by-layer, the speaker-related information can be extracted with non-linear transformation. The experimental results on the core test of NIST SRE 2008 show the superiority of the proposed method. Compared with the linear discriminant analysis based system, the equal error rates(EER) of male and female trials can be reduced to 4.96% and 6.18% respectively. Furthermore, after the fusion of the proposed method with linear discriminant analysis, the EER can be reduced to 4.74% and 5.35%.
|
Received: 03 December 2012
|
|
|
|
|
[1] Reynolds D A, Quatieri T F, Dunn R. Speaker Verification Using Adapted Gaussian Mixture Model. Digital Signal Processing, 2000, 10(1/2/3): 19-41 [2] Kenny P, Ouellet P, Dehak N, et al. A Study of Interspeaker Variability in Speaker Verification. IEEE Trans on Audio, Speech and Language Processing, 2008, 16(5): 980-988 [3] Dehak N, Kenny P, Dehak R, et al. Front-End Factor Analysis for Speaker Verification. IEEE Trans on Audio, Speech and Language Processing, 2011, 19(4): 788-798 [4] Fukunaga K. Introduction to Statistical Pattern Recognition. 2nd Edition. New York, USA: Academic Press, 1990 [5] Hatch A O, Stolcke A. Generalized Linear Kernels for One-versus-All Classification: Application to Speaker Recognition // Proc of the International Conference on Acoustics, Speech and Signal Proce-ssing. Toulouse, France, 2006: 585-588 [6] Mohammed A, Sainath T, Dahl G, et al. Deep Belief Networks Using Discriminative Features for Phone Recognition // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Prague, Czech Republic, 2011: 5060-5063 [7] Mohamed A, Dahl G E, Hinton G E. Acoustic Modeling Using Deep Belief Networks. IEEE Trans on Audio, Speech and Language Processing, 2012, 20 (1): 14-22 [8] Dahl G, Yu Dong, Deng Li, et al. Context-Dependent Pre-Trained Deep Neural Networks for Large Vocabulary Speech Recognition. IEEE Trans on Audio, Speech and Language Processing , 2012, 20(1): 30-42 [9] Salakhutdinov R, Hinton G E. Learning Deep Generative Models. Ph. D Dissertation. Toronto, Canada: University of Toronto, 2011 [10] Kenny P. Joint Factor Analysis of Speaker and Session Variability[EB/OL]. [2012-10-20]. http://www.crim.ca/perso/patrick.kenny [11] Lecun Y, Chopra S, Hadsell R M, et al. A Tutorial on Energy-Based Learning // Bakir G, ed. Predicting Structured Data.Cambridge, USA: MIT Press, 2006: 191-246 [12] Hinton G E, Osindero S, Teh Y. A Fast Learning Algorithm for Deep Belief Nets. Neural Computation, 2006, 18(7): 1527-1554 [13] Sturim D E, Reynolds D A, Singer E, et al. Speaker Indexing in Large Audio Databases Using Anchor Models // Proc of the International Conference on Acoustics, Speech and Signal Processing. Salt Lake City, USA, 2001: 429-432 [14] Lecun Y, Bottou L, Bengio Y, et al. Gradient Based Learning Applied to Document Recognition. Proc of the IEEE, 1998, 86(11): 2278-2324 [15] Stafylakis T, Kenny P, Senoussaoui M, et al. PLDA Using Gaussian Restricted Boltzmann Machines with Application to Speaker Verification[EB/OL]. [2012-10-20]. http://www.crim.ca/perso/patrick.kenny/Stafyl_Interspeech12_presentation.pdf [16] Dehak N, Dehak R, Kenny P, et al. Support Vector Machines versus Fast Scoring in the Low-Dimensional Total Variability Space for Speaker Verification // Proc of the International Conference on Spoken Language Processing. Brighton, UK, 2009: 1559-1562 |
|
|
|