Abstract:End to end speech recognition based on connectionist temporal classification (CTC) is applied to the Tibetan automatic speech recognition(ASR), and the performance is better than that of the state-of-the-art bidirectional long short-term memory approach. In end to end speech recognition,the linguistic knowledge such as pronunciation lexicon is not essential, and therefore the performance of the ASR systems based on CTC is weaker than that of the baseline. Aiming at this problem, a strategy combining the existing linguistic knowledge and the acoustic modeling based on CTC is proposed, and the tri-phone is taken as the basic units in acoustic modeling. Thus, the sparse problem of the modeling unit is effectively solved, and the discrimination and robustness of the CTC model are improved substantially.Results on the test set of Tibetan corpus show that the word accuracy of the model based on CTC is improved substantially and the effectiveness of the combination of the linguistic information and the CTC modeling is verified.
王庆楠 , 郭武, 解传栋. 基于端到端技术的藏语语音识别*[J]. 模式识别与人工智能, 2017, 30(4): 359-364.
WANG Qingnan, GUO Wu, XIE Chuandong. Towards End to End Speech Recognition System for Tibetan. , 2017, 30(4): 359-364.
[1] MOHAMED A, DAHL G E, HINTON G. Acoustic Modeling Using Deep Belief Networks. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(1): 14-22. [2] GERS F A, SCHMIDHUBER J, CUMMINS F. Learning to Forget: Continual Prediction with LSTM. Neural Computation, 2000, 12(10): 2451-2471. [3] GRAVES A, FERNNDEZ S, GOMEZ F, et al. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks // Proc of the 23rd International Conference on Machine Learning. New York, USA: ACM, 2006: 369-376. [4] GRAVES A, JAITLY N. Towards End-to-End Speech Recognition with Recurrent Neural Networks // Proc of the 31st International Conference on Machine Learning. Berlin, Germany: Springer, 2014: 1764-1772. [5] LI J, ZHANG H, CAI X Y, et al. Towards End-to-End Speech Recognition for Chinese Mandarin Using Long Short-Term Memory Recurrent Neural Networks[C/OL]. [2016-08-28]. http://www.isca-speech.org/archive/interspeech_2015/papers/i15_3615.pdf. [6] MIAO Y J, GOWAYYED M, METZE F. EESEN: End-to-End Speech Recognition Using Deep RNN Models and WFST-Based Decoding // Proc of the IEEE Workshop on Automatic Speech Recognition and Understanding. Washington, USA: IEEE, 2015: 167-174. [7] 李冠宇,孟 猛.藏语拉萨话大词表连续语音识别声学模型研究.计算机工程, 2012, 38(5): 189-191. (LI G Y, MENG M. Research on Acoustic Model of Large-Vocabulary Continuous Speech Recognition for Lhasa Tibetan. Computer Engineering, 2012, 38(5): 189-191.) [8] HOCHREITER S, SCHMIDHBER J. Long Short-Term Memory. Neural Computation, 1997, 9(8): 1735-1780. [9] GRAVES A, MOHAMED A R, HINTON G. Speed Recognition with Deep Recurrent Neural Networks // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Washington, USA: IEEE, 2013: 6645-6649. [10] SCHUSTER M, PALIWAL K K. Bidirectional Recurrent Neural Networks. IEEE Transactions on Signal Processing, 1997, 45(11): 2673-2681. [11] GRAVES A, SCHMIDHUBER J. Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures. Neural Networks, 2005, 18(5/6): 602-610. [12] HANNUN A, CASE C, CASPER J, et al. Deep Speech: Scaling up End-to-End Speech Recognition[C/OL]. [2016-08-25]. https://arxiv.org/pdf/1412.5567.pdf. [13] HANNUM A Y, MAAS A L, JURAFSKY D, et al. First-Pass Large Vocabulary Continuous Speech Recognition Using Bi-directional Recurrent DNNs[J/OL]. [2016-08-28]. https://arxiv.org/pdf/1408.2873v2.pdf. [14] MASS A L, XIE Z, JURAFSKY D, et al. Lexicon-Free Conversational Speech Recognition with Neural Networks[C/OL]. [2016-08-28]. http://ai.stanford.edu/~amaas/papers/ctc_clm_naacl_2015.pdf. [15] MOHRI M, PEREIRA F. Weighted Finite-State Transducers in Speech Recognition. Computer Speech & Language, 2002, 16(1): 69-88. [16] DROSTE M, KUICH W, VOGLER H. Handbook of Weighted Automata. Berlin, Germany: Springer, 2009. [17] ALLAUZEN C, MOHRI M. An Optimal Predeterminization Algorithm for Weighted Transducers. Theoretical Computer Science, 2004, 328(1/2): 3-18. [18] POVEY D, GHOSHAL A, BOULIANNE G, et al. The Kaldi Speech Recognition Toolkit[C/OL]. [2016-08-28]. http://publications.idiap.ch/downloads/papers/2012/Povey_ASRU2011_2011.pdf.