Abstract:The existing speaker clustering methods based on Gaussian mixture model (GMM) mainly obtain clusters′ GMMs by adapting from universal background model (UBM). However,this adaptive method suffers from the lack of data and results in poor models. In this paper,two factor analysis modeling methods based on eigenvoice (EV) space analysis and total variability (TV) space analysis respectively are explored. The two methods greatly reduce the number of estimated parameters when clusters′ GMMs are estimated by modeling variability space. The experimental results on two speakers telephone data in 2008 NIST Speaker Recognition Evaluation show that the two proposed methods achieve considerable reduction in speaker error rate compared to the baseline system using MAP adaptation,and the method based on TV space analysis obtains lower speaker error rate compared to the method based on EV space analysis.
[1] Tranter S,Reynolds D. A. An Overview of Automatic Speaker Diarization Systems. IEEE Trans on Audio,Speech,Language Process,2006,14(5): 1557-1565 [2] Gauvain J L,Lamel L,Adda G. Partitioning and Transcription of Broadcast News Data // Proc of the International Conference on Spoken Language Processing. Sydney,Australia,1998: 1335-1338 [3] Chen S S,Gopalakrishnam P S. Speaker,Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion // Proc of the DARPA Broadcast News Transcription and Understanding Workshop. Lansdowne,USA,1998: 127-132 [4] Siegler M A,Jain U,Raj B,et al. Automatic Segmentation,Classification and Clustering of Broadcast News Audio // Proc of the DARPA Speech Recognition Workshop. Chantilly,France,1997: 97-99 [5] Gish H,Siu M,Rohlicek R. Segregation of Speakers for Speech Recognition and Speaker Identification // Proc of the IEEE International Conference on Acoustics,Speech,and Signal Processing. Toronto,Canada,1991: 873-876 [6] Zhu X,Barras C,Meignier S,et al. Combining Speaker Identification and BIC for Speaker Diarization // Proc of the Conference of the International Speech Communication Association. Lisbon,Portugal,2005: 2441-2444 [7] Kenny P,Boulianne G,Dumouchel P. Eigenvoice Modeling with Sparse Training Data. IEEE Trans on Speech Audio Process,2005,13(3): 345-359 [8] Reynolds D A,Quatieri T F,Dunn R. Speaker Verification Using Adapted Gaussian Mixture Model. Digital Signal Processing,10(13),2000: 19-41 [9] Dehak N,Kenny P,Dehak R,et al. Front-End Factor Analysis for Speaker Verification. IEEE Trans on Audio,Speech and Language Processing,2011,19(4): 788-798 [10] Tritschler A,Gopinath R. Improved Speaker Segmentation and Segments Clustering Using the Bayesian Information Criterion // Proc of the Eurospeech. Budapest,Hungary,1999: 679-682 [11] Kenny P. Reynolds D A,Castaldo F. Diarization of Telephone Conversations Using Factor Analysis. IEEE Journal of Selected Topics in Signal Processing,2010,4(6): 1059-1070