|
|
Semi-supervised Acoustic Modeling Based on Perplexity Data Selection |
XIE Chuandong, GUO Wu |
National Engineering Laboratory for Speech and Language Information Processing,University of Science and Technology of China, Hefei 230027 |
|
|
Abstract For acoustic modeling of small languages with rare resource, a perplexity based approach is proposed to select unsupervised data in the decoding transcription and retrain the acoustic model. The large unsupervised corpus is decoded using the initial acoustic model trained with a small amount of labeled data, and the perplexity between the decoded text and the training set is calculated. Then, the selected data similar to the labeled data are used to train the acoustic model along with the labeled data. To improve the correctness of the decoded unsupervised data,the final network parameters of acoustic model are adjusted by only using the correct labeled data in the last iteration during the training of model parameters based on deep neural network. In the VLLP recognition task of Swahili provided by NIST 2015 open keyword search competition, the proposed approach can improve the recognition rate compared with other methods.
|
Received: 21 October 2015
|
|
Corresponding Authors:
GUO Wu(Corresponding author), born in 1973, Ph.D., associate professor. His research interests include speech recognition and speaker recognition.
|
About author:: XIE Chuandong, born in 1990, master student. His research interests include speech recognition and keyword retrieval. |
|
|
|
[1] JAITLY N, HINTON G E. Vocal Tract Length Perturbation (VTLP) Improves Speech Recognition [C/OL].[2013-10-23]. http://www.cs.toronto.edu/~ndjaitly/jaitly-icml13.pdf. [2] KANDA N, TAKEDA R, OBUCHI Y. Elastic Spectral Distortion for Low Resource Speech Recognition with Deep Neural Networks // Proc of the IEEE Workshop on Automatic Speech Recognition and Understanding. Olomouc, Czech Republic: IEEE, 2013: 309-314. [3] CUI X D, GOEL V, KINGSBURY B. Data Augmentation for Deep Neural Network Acoustic Modeling // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Florence, Italy: IEEE, 2014: 5582-5586. [4] FRAGA-SILVA T, GAUVAIN J L, LAMEL L, et al. Active Learning Based Data Selection for Limited Resource STT and KWS[C/OL].[2015-09-06]. http://www.vocapia.com/pub lications/IS2015-AL.pdf. [5] NI C J, LEUNG C C, WANG L, et al. Unsupervised Data Selection and Word-Morph Mixed Language Model for Tamil Low-Resource Keyword Search // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. South Brisbane, Australia: IEEE, 2015: 4714-4718. [6] MA J, MATSOUKAS S, KIMBALL O, et al. Unsupervised Training on Large Amounts of Broadcast News Data // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Toulouse, France: IEEE, 2006. DOI:10.1109/ICASSP. 2006.1660839. [7] LAMEL L, GAUVAIN J L, ADDA G. Unsupervised Acoustic Model Training // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Orlando, USA: IEEE, 2002, I: 877-880. [8] DUTAGˇACI H. Statistical Language Models for Large Vocabulary Turkish Speech Recognition[EB/OL]. [2015-09-03]. http: // busim.ee.boun.edu.tr/~speech/publications/Theses/helin-du tagaci.pdf. [9] HINTON G E, OSINDERO S, TEH Y W. A Fast Learning Algorithm for Deep Belief Nets. Neural Computation, 2006, 18(7): 1527-1554. [10] DAHL G E, YU D, DENG L, et al. Context-Dependent Pre-trained Deep Neural Networks for Large-Vocabulary Speech Recognition. IEEE Trans on Audio, Speech, and Language Processing, 2012, 20(1): 30-42. [11] LIN H, DENG L, YU D, et al. A Study on Multilingual Acoustic Modeling for Large Vocabulary ASR // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Taibei, China: IEEE, 2009: 4333-4336. [12] HUANG J T, LI J Y, YU D, et al. Cross-Language Knowledge Transfer Using Multilingual Deep Neural Network with Shared Hidden Layers // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada: IEEE, 2013: 7304-7308. [13] POVEY D, GHOSHAL A, BOULIANNE G, et al. The Kaldi Speech Recognition Toolkit [C/OL]. [2015-09-13]. http:// publications.idiap.ch/downloads/papers/2012/PoveyASRLL- 2011-2011.pdf. [14] GHAHREMANI P, BABAALI B, POVEY D, et al. A Pitch Extraction Algorithm Tuned for Automatic Speech Recognition // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Florence, Italy: IEEE, 2014: 2494-2498. [15] VIRPIOJA S, SMIT P, GRNROOS S A, et al. Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline. [EB/OL]. [2015-09-13]. https://aaltodoc.aalto.fi/bitst ream/handle/123456789/11836/isbn9789526055015.pdf?sequ ence=1&isAllowed=y. [16] XU H H, SU H, CHNG E S, et al. Semi-supervised Training for Bottle-Neck Feature Based DNN-HMM Hybrid Systems // Proc of the 15th Annual Conference of the International Speech Communication Association. Glenelg North, USA: Causal Productions Pty Ltd, 2014: 2078-2082. |
|
|
|