基于端到端技术的藏语语音识别<sup>*</sup>

doi:10.16451/j.cnki.issn1003-6059.201704008

摘要
图/表
参考文献
相关文章 (2)

全文: PDF (555 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要现阶段基于链接时序分类技术的端到端的大规模连续语音识别成为研究热点,文中将其应用于藏语识别中,取得优于主流的双向长短时记忆网络性能.在基于端到端的语音识别中,不需要发音字典等语言学知识,识别性能无法得到保证.文中提出将已有的语言学知识结合至端到端的声学建模中,采用绑定的三音子作为建模单元,解决建模单元的稀疏性问题,大幅提高声学建模的区分度和鲁棒性.在藏语测试集上,通过实验证明文中方法提高基于链接时序分类技术的声学模型的识别率,并验证语言学知识和基于端到端声学建模技术结合的有效性.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	王庆楠
	郭武
	解传栋

关键词 ：端到端, 藏语, 自动语音识别, 链接时序分类

Abstract：End to end speech recognition based on connectionist temporal classification (CTC) is applied to the Tibetan automatic speech recognition(ASR), and the performance is better than that of the state-of-the-art bidirectional long short-term memory approach. In end to end speech recognition,the linguistic knowledge such as pronunciation lexicon is not essential, and therefore the performance of the ASR systems based on CTC is weaker than that of the baseline. Aiming at this problem, a strategy combining the existing linguistic knowledge and the acoustic modeling based on CTC is proposed, and the tri-phone is taken as the basic units in acoustic modeling. Thus, the sparse problem of the modeling unit is effectively solved, and the discrimination and robustness of the CTC model are improved substantially.Results on the test set of Tibetan corpus show that the word accuracy of the model based on CTC is improved substantially and the effectiveness of the combination of the linguistic information and the CTC modeling is verified.

Key words： End to End Tibetan Automatic Speech Recognition Connectionist Temporal Classification

收稿日期: 2016-09-30

基金资助:国家重点研发计划项目(No.2016YFB1001300)资助

作者简介: 王庆楠(通讯作者),男,1992年生,硕士研究生,主要研究方向为语音识别.E-mail:wqn628@mail.ustc.edu.cn.
郭武,男,1973 年生,博士,副教授,主要研究方向为语音识别、说话人识别.E-mail:guowu@ustc.edu.cn.
解传栋,男,1990年生,硕士研究生,主要研究方向为语音识别、关键词检索.E-mail:xcdahu@mail.ustc.edu.cn.

引用本文:

王庆楠，郭武，解传栋. 基于端到端技术的藏语语音识别^*[J]. 模式识别与人工智能, 2017, 30(4): 359-364. WANG Qingnan, GUO Wu, XIE Chuandong. Towards End to End Speech Recognition System for Tibetan. , 2017, 30(4): 359-364.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/10.16451/j.cnki.issn1003-6059.201704008 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2017/V30/I4/359

[1] MOHAMED A, DAHL G E, HINTON G. Acoustic Modeling Using Deep Belief Networks. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(1): 14-22.
[2] GERS F A, SCHMIDHUBER J, CUMMINS F. Learning to Forget: Continual Prediction with LSTM. Neural Computation, 2000, 12(10): 2451-2471.
[3] GRAVES A, FERNNDEZ S, GOMEZ F, et al. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks // Proc of the 23rd International Conference on Machine Learning. New York, USA: ACM, 2006: 369-376.
[4] GRAVES A, JAITLY N. Towards End-to-End Speech Recognition with Recurrent Neural Networks // Proc of the 31st International Conference on Machine Learning. Berlin, Germany: Springer, 2014: 1764-1772.
[5] LI J, ZHANG H, CAI X Y, et al. Towards End-to-End Speech Recognition for Chinese Mandarin Using Long Short-Term Memory Recurrent Neural Networks[C/OL]. [2016-08-28]. http://www.isca-speech.org/archive/interspeech_2015/papers/i15_3615.pdf.
[6] MIAO Y J, GOWAYYED M, METZE F. EESEN: End-to-End Speech Recognition Using Deep RNN Models and WFST-Based Decoding // Proc of the IEEE Workshop on Automatic Speech Recognition and Understanding. Washington, USA: IEEE, 2015: 167-174.
[7] 李冠宇,孟猛.藏语拉萨话大词表连续语音识别声学模型研究.计算机工程, 2012, 38(5): 189-191.
(LI G Y, MENG M. Research on Acoustic Model of Large-Vocabulary Continuous Speech Recognition for Lhasa Tibetan. Computer Engineering, 2012, 38(5): 189-191.)
[8] HOCHREITER S, SCHMIDHBER J. Long Short-Term Memory. Neural Computation, 1997, 9(8): 1735-1780.
[9] GRAVES A, MOHAMED A R, HINTON G. Speed Recognition with Deep Recurrent Neural Networks // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Washington, USA: IEEE, 2013: 6645-6649.
[10] SCHUSTER M, PALIWAL K K. Bidirectional Recurrent Neural Networks. IEEE Transactions on Signal Processing, 1997, 45(11):
2673-2681.
[11] GRAVES A, SCHMIDHUBER J. Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures. Neural Networks, 2005, 18(5/6): 602-610.
[12] HANNUN A, CASE C, CASPER J, et al. Deep Speech: Scaling up End-to-End Speech Recognition[C/OL]. [2016-08-25]. https://arxiv.org/pdf/1412.5567.pdf.
[13] HANNUM A Y, MAAS A L, JURAFSKY D, et al. First-Pass Large Vocabulary Continuous Speech Recognition Using Bi-directional Recurrent DNNs[J/OL]. [2016-08-28]. https://arxiv.org/pdf/1408.2873v2.pdf.
[14] MASS A L, XIE Z, JURAFSKY D, et al. Lexicon-Free Conversational Speech Recognition with Neural Networks[C/OL]. [2016-08-28]. http://ai.stanford.edu/~amaas/papers/ctc_clm_naacl_2015.pdf.
[15] MOHRI M, PEREIRA F. Weighted Finite-State Transducers in Speech Recognition. Computer Speech & Language, 2002, 16(1): 69-88.
[16] DROSTE M, KUICH W, VOGLER H. Handbook of Weighted Automata. Berlin, Germany: Springer, 2009.
[17] ALLAUZEN C, MOHRI M. An Optimal Predeterminization Algorithm for Weighted Transducers. Theoretical Computer Science, 2004, 328(1/2): 3-18.
[18] POVEY D, GHOSHAL A, BOULIANNE G, et al. The Kaldi Speech Recognition Toolkit[C/OL]. [2016-08-28]. http://publications.idiap.ch/downloads/papers/2012/Povey_ASRU2011_2011.pdf.