Speech Recognition Based on Deep Neural Networks on Tibetan Corpus
YUAN Sheng-Long, GUO Wu, DAI Li-Rong
National Engineering Laboratory for Speech and Language Information Processing,Department of Electronic Engineering and Information Science,University of Science and Technology of China, Hefei 230027
Abstract:Large vocabulary continuous speech recognition on telephonic conversational Tibetan is firstly addressed in this paper. As a minority language, the major difficulty in Tibetan speech recognition is data deficiency. In this paper, the acoustic model of Tibetan is trained based on deep neural networks (DNN).To address the issue of data deficiencies, the DNN models of other majority languages are used as the initial networks of the objective Tibetan DNN model. In addition, phonetic questions of Tibetan generated by phonetic expert are unavailable due to the lacking knowledge of phonetics. To reduce the number of tri-phone hidden Markov models(HMM) in Tibetan speech recognition, phonetic questions automatically generated in the data driven manner are used for tying the tri-phone HMM. In this paper, different clustering of tri-phone states is tested and the words accuracy is about 30.86% on the test corpus by Gaussian mixture model(GMM). When the acoustic model is trained based on DNN, 3 kinds of DNN model trained by different large corpus are adopted. The experimental results show that the proposed methods can improve the recognition performance, and the words accuracy is about 43.26% on the test corpus.
[1] Yao X, Li Y H, Shan G R, et al. Research on Tibetan Isolated-word Speech Recognition System. Journal of Northwest University for Nationalities: Natural Science, 2009, 30(1): 29-36,50 (in Chinese) (姚 徐,李永宏,单广荣,等.藏语孤立词语音识别系统研究.西北民族大学学报:自然科学版, 2009, 30(1): 29-36,50) [2] Han Q H, Yu H Z. Research on Speech Recognition for Ando Tibetan Based on HMM. Software Guide, 2010, 9(7): 173-175 (in Chinese) (韩清华,于洪志.基于HMM的安多藏语非特定人孤立词语音识别研究. 软件导刊, 2010, 9(7): 173-175) [3] Li G Y, Meng M. Research on Acoustic Model of Large-Vocabulary Continuous Speech Recognition for Lhasa Tibetan. Computer Engineering, 2012, 38(5): 189-191(in Chinese) (李冠宇,孟 猛.藏语拉萨话大词表连续语音识别声学模型研究.计算机工程, 2012, 38(5): 189-191) [4] Dahl G E, Yu D, Deng L, et al. Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition. IEEE Trans on Audio, Speech, and Language Processing, 2012, 20(1): 30-42 [5] Hinton G E, Osindero S, Teh Y W. A Fast Learning Algorithm for Deep Belief Nets. Neural Computation, 2006, 18(7): 1527-1554 [6] Beulen K, Ney H. Automatic Question Generation for Decision Tree Based State Tying // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Seattle, USA, 1998, II: 805-805 [7] Singh R, Raj B, Stern R M. Automatic Clustering and Generation of Contextual Questions for Tied States in Hidden Markov Models // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Phoenix, USA, 1999, I: 117-120 [8] Huang J T, Li J Y, Yu D, et al. Cross-Language Knowledge Transfer Using Multilingual Deep Neural Network with Shared Hidden Layers // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada, 2013: 7304-7308 [9] Carreira-Perpinan M A, Hinton G E. On Contrastive Divergence Learning. [EB/OL]. [2013-02-15]. www.docin.com/p-33657so63.html [10] Mohamed A, Dahl G E, Hinton G. Acoustic Modeling Using Deep Belief Networks. IEEE Trans on Audio, Speech, and Language Processing, 2012, 20(1): 14-22 [11] Erhan D, Bengio Y, Courville A, et al. Why Does Unsupervised Pre-training Help Deep Learning? Journal of Machine Learning Research. 2010, 11: 625-660 [12] Deng L, Seltzer M, Yu D, et al. Binary Coding of Speech Spectrograms Using a Deep Auto-Encoder // Proc of the 11th Annual Conference of the International Speech Communication Association. Makuhari, Japan, 2010: 1692-1695