|
|
Lightweight End-to-End Architecture for Streaming Speech Recognition |
YANG Shuying1, LI Xin1 |
1. School of Computer Science and Engineering, Tianjin University of Technology, Tianjin 300384 |
|
|
Abstract In streaming recognition methods, chunked recognition destroys parallelism and consumes more resources, while contextual recognition with restricted self-attention mechanism is difficult to obtain all information.Therefore, a lightweight end-to-end acoustic recognition method based on Chunk, CFLASH-Transducer, is proposed by combining the fast linear attention with a single head(FLASH) and convolutional neural networks(CNNs) to obtain delicate local features. Inception V2 network is introduced into the convolutional block to extract multi-scale local features of the speech signal.The coordinate attention mechanism is adopted to capture the location information of the features and interconnections among multiple channels. The depthwise separable convolution is utilized for feature enhancement and smooth transition between layers. The recurrent neural network transducer(RNN-T) architecture is employed for training and decoding to process audio. Global attention computed in the current block is passed into subsequent blocks as a hidden variable, connecting the information of each block, retaining the training parallelism and correlation, and avoiding the consumption of computing resources as the sequence grows.CFLASH-Transducer achieves high recognition accuracy on the open source dataset THCHS30 with the loss of streaming recognition accuracy less than 1% compared to offline recognition.
|
Received: 15 November 2022
|
|
Fund:Tianjin Virtual Simulation Experimental Teaching Project(No.Jin Education and Government Office [2019] No.69), Key Teaching Fund Project of Tianjin University of Technology(No.ZD20-04) |
Corresponding Authors:
YANG Shuying, Ph.D., professor. Her research interests include pattern recognition, time series and speech recognition.
|
About author:: LI Xin, master student. His research interests include deep learning, speech recognition and natural language processing. |
|
|
|
[1] GRAVES A, FERNÁNDEZ S, GOMEZ F, et al. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks // Proc of the 23rd International Conference on Machine Learning. San Diego, USA: JMLR, 2006: 369-376. [2] CHO K, VAN MERRIËNBOER B, GULCEHRE C, et al. Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation // Proc of the Conference on Empirical Me-thods in Natural Language Processing. Stroudsburg, USA: ACL, 2014: 1724-1734. [3] BAHDANAU D, CHO K, BENGIO Y. Neural Machine Translation by Jointly Learning to Align and Translate[C/OL]. [2022-10-17]. https://arxiv.org/pdf/1409.0473.pdf. [4] GRAVES A. Sequence Transduction with Recurrent Neural Networks[C/OL]. [2022-10-17]. https://arxiv.org/pdf/1211.3711.pdf. [5] WANG C Y, WU Y, LU L, et al. Low Latency End-to-End Streaming Speech Recognition with a Scout Network[C/OL].[2022-10-17]. https://arxiv.org/abs/2003.10369. [6] VASWANNI A, SHAZEER N, PARMAR N, et al. Attention Is All You Need // Proc of the 31st International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2017: 6000-6010. [7] DAI Z H, YANG Z L, YANG Y M, et al. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context // Proc of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, USA: ACL, 2019: 2978-2988. [8] IRIE K, ZEYER A, SCHLÜTER R, et al. Language Modeling with Deep Transformers[C/OL].[2022-10-17]. https://arxiv.org/pdf/1905.04226v2.pdf. [9] DONG L H, XU S, XU B. Speech-Transformer: A No-recurrence Sequence-to-Sequence Model for Speech Recognition // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Washington, USA: IEEE, 2018: 5884-5888. [10] ZHANG Q, LU H, SAK H, et al. Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Washington, USA: IEEE, 2020: 7829-7833. [11] NIE M X, LEI Z C. Hybrid CTC/Attention Architecture with Self-Attention and Convolution Hybrid Encoder for Speech Recognition. Journal of Physics(Conference Series), 2020, 1549. DOI: 10.1088/1742-6596/1549/5/052034. [12] HUANG W Y, HU W C, YEUNG Y T, et al. Conv-Transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-End Speech Recognition[C/OL].[2022-10-17]. https://arxiv.org/pdf/2008.05750.pdf. [13] GULATI A, QIN J, CHIU C C, et al. Conformer: Convolution-Augmented Transformer for Speech Recognition[C/OL].[2022-10-17]. https://arxiv.org/pdf/2005.08100.pdf. [14] CHILD R, GRAY S, RADFORD A, et al. Generating Long Se-quences with Sparse Transformers[C/OL].[2022-10-17]. https://arxiv.org/pdf/1904.10509v1.pdf. [15] KITAEV N, KAISER Ł, LEVSKAYA A.Reformer: The Efficient Transformer[C/OL]. [2022-10-17].https://openreview.net/pdf?id=rkgNKkHtvB. [16] LU J C, YAO J H, ZHANG J G, et al. SOFT: Softmax-Free Transformer with Linear Complexity Advances in Neural Information Processing Systems[C/OL].[2022-10-17]. https://openreview.net/pdf?id=rndqBJsGoKh. [17] KATHAROPOULOS A, VYAS A, PAPPAS N, et al. Transformers Are RNNs: Fast Autoregressive Transformers with Linear Attention // Proc of the 37th International Conference on Machine Learning. San Diego, USA: JMLR, 2020: 5156-5165. [18] MORITZ N, HORI T, LE J. Streaming Automatic Speech Recognition with the Transformer Model // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Washington, USA: IEEE, 2020: 6074-6078. [19] TSUNOO E, KASHIWAGI Y, WATANABE S. Streaming Transformer ASR with Blockwise Synchronous Beam Search // Proc of the IEEE Spoken Language Technology Workshop. Washington, USA: IEEE, 2021: 22-29. [20] DONG L H, WANG F, XU B. Self-Attention Aligner: A Latency-Control End-to-End Model for ASR Using Self-Attention Network and Chunk-Hopping // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Washington, USA: IEEE, 2019: 5656-5660. [21] HUA W Z, DAI Z H, LIU H X, et al. Transformer Quality in Linear Time // Proc of the 39th International Conference on Machine Learning. San Diego, USA: JMLR, 2022: 9099-9117. [22] HOU Q B, ZHOU D Q, FENG J S. Coordinate Attention for Efficient Mobile Network Design // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2021: 13708-13717. [23] ZHENG Y J, MA Y T, TIAN C L. TMRN-GLU: A Transformer-Based Automatic Classification Recognition Network Improved by Gate Linear Unit. Electronics, 2022, 11(10). DOI: 10.3390/electronics11101554. [24] LOFFE S, SZEGEDY C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift // Proc of the 32nd International Conference on Machine Learning. San Diego, USA: JMLR, 2015: 448-456. [25] RABINER L, JUANG B H. Fundamentals of Speech Recognition. Upper Saddle River, USA: Prentice-Hall, 1993. [26] PARK D S, CHAN W, ZHANG Y, et al. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition[C/OL].[2022-10-17]. https://arxiv.org/pdf/1904.08779.pdf. [27] WANG Y Q, MOHAMED A, LE D, et al. Transformer-Based Acoustic Modeling for Hybrid Speech Recognition // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Washington, USA: IEEE, 2020: 6874-6878. [28] GUO P C, BOYER F, CHANG X K, et al. Recent Developments on Espnet Toolkit Boosted by Conformer // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Washington, USA: IEEE, 2021: 5874-5878. [29] QIAN Y M, BI M X, TAN T, et al. Very Deep Convolutional Neural Networks for Noise Robust Speech Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24(12): 2263-2276. [30] KRIMAN S, BELIAEV S, GINSBURG B, et al. QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Washington, USA: IEEE, 2020: 6124-6128. [31] LIKHOMANENKO T, XU Q T, PRATAP V, et al. Rethinking Evaluation in ASR: Are Our Models Robust Enough?[C/OL]. [2022-08-17]. https://arxiv.org/pdf/2010.11745.pdf. [32] SYNNAEVE G, XU Q T, KAHN J K, et al. End-to-End ASR: From Supervised to Semi-Supervised Learning with Modern Architectures[C/OL].[2022-10-17]. https://arxiv.org/pdf/1911.08460.pdf. [33] RAO K, SAK H, PRABHAVALKAR R. Exploring Architectures, Data and Units for Streaming End-to-End Speech Recognition with RNN-Transducer // Proc of the IEEE Automatic Speech Recognition and Understanding Workshop. Washington, USA: IEEE, 2017: 193-199. [34] HAN W, ZHANG Z D, ZHANG Y, et al. ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context[C/OL].[2022-10-17]. https://arxiv.org/pdf/2005.03191.pdf. |
|
|
|