Abstract:To overcome the drawbacks of original speech features that long temporal speeches and the supervised information can not be effectively utilized and the training time cost is high, a bottleneck feature extraction method based on hierarchical deep sparse belief network is presented. The overlapping group lasso is used as the sparse regularization constraint of the objective function of deep belief network to obtain a deep sparse belief network with a higher speed. To make full use of the hierarchical structure, two sparse deep belief networks are connected in series to enhance the discriminant ability of the bottleneck features. The experimental results on phoneme recognition task show that the proposed feature is effective.
王一,杨俊安,刘辉,柳林. 基于层次稀疏DBN的瓶颈特征提取方法*[J]. 模式识别与人工智能, 2015, 28(2): 173-180.
WANG Yi, YANG Jun-An, LIU Hui, LIU Lin. Bottleneck Feature Extraction Method Based on Hierarchical Deep Sparse Belief Network. , 2015, 28(2): 173-180.
[1] Han J Q, Zhang L, Zheng T R. Speech Signal Processing. Beijing, China: Tsinghua University Press, 2005 (in Chinese) (韩纪庆,张 磊,郑铁然.语音信号处理.北京:清华大学出版社,2005) [2] Schwarz P. Phoneme Recognition Based on Long Temporal Context[EB/OL]. [2013-07-10]. http://speech.fit.vutbr.cz/software/phoneme-recognizer-based-long-temporal-context [3] Jansen A, Niyogi P. Point Process Models for Spotting Keywords in Continuous Speech. IEEE Trans on Audio, Speech, and Language Processing, 2009, 17(8): 1457-1470 [4] Matějka P, Schwarz P, Cˇernock J, et al. Phonotactic Language Identification Using High Quality Phoneme Recognition // Proc of the 9th European Conference on Speech Communication and Technology. Lisbon, Portugal, 2005: 2237-2240 [5] Grézl F, Karafiát M, Kontár S, et al. Probabilistic and Bottle-Neck Features for LVCSR of Meetings // Proc of the IEEE International Conference on Acoustics, Speech, and Signal Processing. Honolulu, USA, 2007, IV: 757-760 [6] Mohamed A, Sainath T N, Dahl G, et al. Deep Belief Networks Using Discriminative Features for Phone Recognition // Proc of the IEEE International Conference on Acoustics, Speech, and Signal Processing. Prague, Czech Republic, 2011: 5060-5063 [7] Hinton G E, Salakhutdinov R R. Reducing the Dimensionality of Data with Neural Networks. Science, 2006, 313(5786): 504-507 [8] Deng L. An Overview of Deep-Structured Learning for Information Processing // Proc of the Asian-Pacific Signal and Information Processing Association Annual Summit and Conference. Xi′an, China, 2011: 1-14 [9] Sivaram G S V S, Hermansky H. Sparse Multilayer Perceptron for Phoneme Recognition. IEEE Trans on Audio, Speech, and Language Processing, 2012, 20(1): 23-29 [10] Yu D, Seide F, Li G, et al. Exploiting Sparseness in Deep Neural Networks for Large Vocabulary Speech Recognition // Proc of the IEEE International Conference on Acoustics, Speech, and Signal Processing. Kyoto, Japan, 2012: 4409-4412 [11] Luo H. Restricted Boltzmann Machines: A Collaborative Filtering Perspective. Ph.D Dissertation. Shanghai, China: Shanghai Jiao Tong University, 2011(in Chinese) (罗 恒.基于协同过滤视角的受限玻尔兹曼机研究.博士学位论文.上海:上海交通大学, 2011) [12] Mohamed A, Dahl G E, Hinton G. Acoustic Modeling Using Deep Belief Networks. IEEE Trans on Audio, Speech, and Language Processing, 2012, 20(1): 14-22 [13] Siniscalchi S M, Yu D, Deng L, et al. Speech Recognition Using Long-Span Temporal Patterns in a Deep Network Model. IEEE Signal Processing Letters, 2013, 20(3): 201-204 [14] Bergstra J, Breuleux O, Bastien F, et al. Theano: A CPU and GPU Math Compiler in Python [EB/OL]. [2013-07-01]. http://www.iro.umontreal.ca/~lisa/pointeurs/theano-scipy2010.pdf [15] Yu D, Seltzer M. Improved Bottleneck Features Using Pretrained Deep Neural Networks // Proc of the 12th Annual Conference of the International Speech Communication Association. Florence, Italy, 2011: 237-240