模式识别与人工智能
Wednesday, Apr. 16, 2025 Home      About Journal      Editorial Board      Instructions      Ethics Statement      Contact Us                   中文
Pattern Recognition and Artificial Intelligence  2023, Vol. 36 Issue (3): 268-279    DOI: 10.16451/j.cnki.issn1003-6059.202303006
Researches and Applications Current Issue| Next Issue| Archive| Adv Search |
Lightweight End-to-End Architecture for Streaming Speech Recognition
YANG Shuying1, LI Xin1
1. School of Computer Science and Engineering, Tianjin University of Technology, Tianjin 300384

Download: PDF (705 KB)   HTML (1 KB) 
Export: BibTeX | EndNote (RIS)      
Abstract  In streaming recognition methods, chunked recognition destroys parallelism and consumes more resources, while contextual recognition with restricted self-attention mechanism is difficult to obtain all information.Therefore, a lightweight end-to-end acoustic recognition method based on Chunk, CFLASH-Transducer, is proposed by combining the fast linear attention with a single head(FLASH) and convolutional neural networks(CNNs) to obtain delicate local features. Inception V2 network is introduced into the convolutional block to extract multi-scale local features of the speech signal.The coordinate attention mechanism is adopted to capture the location information of the features and interconnections among multiple channels. The depthwise separable convolution is utilized for feature enhancement and smooth transition between layers. The recurrent neural network transducer(RNN-T) architecture is employed for training and decoding to process audio. Global attention computed in the current block is passed into subsequent blocks as a hidden variable, connecting the information of each block, retaining the training parallelism and correlation, and avoiding the consumption of computing resources as the sequence grows.CFLASH-Transducer achieves high recognition accuracy on the open source dataset THCHS30 with the loss of streaming recognition accuracy less than 1% compared to offline recognition.
Key wordsAutomatic Speech Recognition      Streaming Recognition      FLASH(Fast Linear Attention with a Single Head)      CNN(Convolutional Neural Network)      RNN-T(Recurrent Neural Network Transducer)     
Received: 15 November 2022     
ZTFLH: TN912.34  
  TP391.4  
Fund:Tianjin Virtual Simulation Experimental Teaching Project(No.Jin Education and Government Office [2019] No.69), Key Teaching Fund Project of Tianjin University of Technology(No.ZD20-04)
Corresponding Authors: YANG Shuying, Ph.D., professor. Her research interests include pattern recognition, time series and speech recognition.   
About author:: LI Xin, master student. His research interests include deep learning, speech recognition and natural language processing.
Service
E-mail this article
Add to my bookshelf
Add to citation manager
E-mail Alert
RSS
Articles by authors
YANG Shuying
LI Xin
Cite this article:   
YANG Shuying,LI Xin. Lightweight End-to-End Architecture for Streaming Speech Recognition[J]. Pattern Recognition and Artificial Intelligence, 2023, 36(3): 268-279.
URL:  
http://manu46.magtech.com.cn/Jweb_prai/EN/10.16451/j.cnki.issn1003-6059.202303006      OR     http://manu46.magtech.com.cn/Jweb_prai/EN/Y2023/V36/I3/268
Copyright © 2010 Editorial Office of Pattern Recognition and Artificial Intelligence
Address: No.350 Shushanhu Road, Hefei, Anhui Province, P.R. China Tel: 0551-65591176 Fax:0551-65591176 Email: bjb@iim.ac.cn
Supported by Beijing Magtech  Email:support@magtech.com.cn