摘要 在流式识别方法中,分块识别破坏并行性且消耗资源较大,而限制自注意力机制的上下文识别很难获得所有信息.由此,文中提出轻量化端到端声学架构(CFLASH-Transducer).为了获取细腻的局部特征,采用轻量化的FLASH(Fast Linear Attention with a Single Head)与卷积神经网络块结合.卷积块中采用Inception V2网络,提取语音信号多尺度的局部特征.再通过Coordinate Attention机制捕获特征的位置信息和多通道之间的相互关联.此外,采用深度可分离卷积,用于特征增强和层间平滑过渡.为了使其可流式化处理音频,采用RNN-T(Recurrent Neural Network Transducer)架构进行训练与解码.将当前块已经计算的全局注意力作为隐变量,传入后续块中,串联各块信息,保留训练的并行性和相关性,并且不会随着序列的增长而消耗计算资源.在开源数据集THCHS30上进行训练与测试,CFLASH-Transducer取得较高的识别率.并且相比离线识别,流式识别精度损失不超过1%.
Abstract:In streaming recognition methods, chunked recognition destroys parallelism and consumes more resources, while contextual recognition with restricted self-attention mechanism is difficult to obtain all information.Therefore, a lightweight end-to-end acoustic recognition method based on Chunk, CFLASH-Transducer, is proposed by combining the fast linear attention with a single head(FLASH) and convolutional neural networks(CNNs) to obtain delicate local features. Inception V2 network is introduced into the convolutional block to extract multi-scale local features of the speech signal.The coordinate attention mechanism is adopted to capture the location information of the features and interconnections among multiple channels. The depthwise separable convolution is utilized for feature enhancement and smooth transition between layers. The recurrent neural network transducer(RNN-T) architecture is employed for training and decoding to process audio. Global attention computed in the current block is passed into subsequent blocks as a hidden variable, connecting the information of each block, retaining the training parallelism and correlation, and avoiding the consumption of computing resources as the sequence grows.CFLASH-Transducer achieves high recognition accuracy on the open source dataset THCHS30 with the loss of streaming recognition accuracy less than 1% compared to offline recognition.
[1] GRAVES A, FERNÁNDEZ S, GOMEZ F, et al. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks // Proc of the 23rd International Conference on Machine Learning. San Diego, USA: JMLR, 2006: 369-376. [2] CHO K, VAN MERRIËNBOER B, GULCEHRE C, et al. Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation // Proc of the Conference on Empirical Me-thods in Natural Language Processing. Stroudsburg, USA: ACL, 2014: 1724-1734. [3] BAHDANAU D, CHO K, BENGIO Y. Neural Machine Translation by Jointly Learning to Align and Translate[C/OL]. [2022-10-17]. https://arxiv.org/pdf/1409.0473.pdf. [4] GRAVES A. Sequence Transduction with Recurrent Neural Networks[C/OL]. [2022-10-17]. https://arxiv.org/pdf/1211.3711.pdf. [5] WANG C Y, WU Y, LU L, et al. Low Latency End-to-End Streaming Speech Recognition with a Scout Network[C/OL].[2022-10-17]. https://arxiv.org/abs/2003.10369. [6] VASWANNI A, SHAZEER N, PARMAR N, et al. Attention Is All You Need // Proc of the 31st International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2017: 6000-6010. [7] DAI Z H, YANG Z L, YANG Y M, et al. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context // Proc of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, USA: ACL, 2019: 2978-2988. [8] IRIE K, ZEYER A, SCHLÜTER R, et al. Language Modeling with Deep Transformers[C/OL].[2022-10-17]. https://arxiv.org/pdf/1905.04226v2.pdf. [9] DONG L H, XU S, XU B. Speech-Transformer: A No-recurrence Sequence-to-Sequence Model for Speech Recognition // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Washington, USA: IEEE, 2018: 5884-5888. [10] ZHANG Q, LU H, SAK H, et al. Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Washington, USA: IEEE, 2020: 7829-7833. [11] NIE M X, LEI Z C. Hybrid CTC/Attention Architecture with Self-Attention and Convolution Hybrid Encoder for Speech Recognition. Journal of Physics(Conference Series), 2020, 1549. DOI: 10.1088/1742-6596/1549/5/052034. [12] HUANG W Y, HU W C, YEUNG Y T, et al. Conv-Transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-End Speech Recognition[C/OL].[2022-10-17]. https://arxiv.org/pdf/2008.05750.pdf. [13] GULATI A, QIN J, CHIU C C, et al. Conformer: Convolution-Augmented Transformer for Speech Recognition[C/OL].[2022-10-17]. https://arxiv.org/pdf/2005.08100.pdf. [14] CHILD R, GRAY S, RADFORD A, et al. Generating Long Se-quences with Sparse Transformers[C/OL].[2022-10-17]. https://arxiv.org/pdf/1904.10509v1.pdf. [15] KITAEV N, KAISER Ł, LEVSKAYA A.Reformer: The Efficient Transformer[C/OL]. [2022-10-17].https://openreview.net/pdf?id=rkgNKkHtvB. [16] LU J C, YAO J H, ZHANG J G, et al. SOFT: Softmax-Free Transformer with Linear Complexity Advances in Neural Information Processing Systems[C/OL].[2022-10-17]. https://openreview.net/pdf?id=rndqBJsGoKh. [17] KATHAROPOULOS A, VYAS A, PAPPAS N, et al. Transformers Are RNNs: Fast Autoregressive Transformers with Linear Attention // Proc of the 37th International Conference on Machine Learning. San Diego, USA: JMLR, 2020: 5156-5165. [18] MORITZ N, HORI T, LE J. Streaming Automatic Speech Recognition with the Transformer Model // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Washington, USA: IEEE, 2020: 6074-6078. [19] TSUNOO E, KASHIWAGI Y, WATANABE S. Streaming Transformer ASR with Blockwise Synchronous Beam Search // Proc of the IEEE Spoken Language Technology Workshop. Washington, USA: IEEE, 2021: 22-29. [20] DONG L H, WANG F, XU B. Self-Attention Aligner: A Latency-Control End-to-End Model for ASR Using Self-Attention Network and Chunk-Hopping // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Washington, USA: IEEE, 2019: 5656-5660. [21] HUA W Z, DAI Z H, LIU H X, et al. Transformer Quality in Linear Time // Proc of the 39th International Conference on Machine Learning. San Diego, USA: JMLR, 2022: 9099-9117. [22] HOU Q B, ZHOU D Q, FENG J S. Coordinate Attention for Efficient Mobile Network Design // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2021: 13708-13717. [23] ZHENG Y J, MA Y T, TIAN C L. TMRN-GLU: A Transformer-Based Automatic Classification Recognition Network Improved by Gate Linear Unit. Electronics, 2022, 11(10). DOI: 10.3390/electronics11101554. [24] LOFFE S, SZEGEDY C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift // Proc of the 32nd International Conference on Machine Learning. San Diego, USA: JMLR, 2015: 448-456. [25] RABINER L, JUANG B H. Fundamentals of Speech Recognition. Upper Saddle River, USA: Prentice-Hall, 1993. [26] PARK D S, CHAN W, ZHANG Y, et al. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition[C/OL].[2022-10-17]. https://arxiv.org/pdf/1904.08779.pdf. [27] WANG Y Q, MOHAMED A, LE D, et al. Transformer-Based Acoustic Modeling for Hybrid Speech Recognition // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Washington, USA: IEEE, 2020: 6874-6878. [28] GUO P C, BOYER F, CHANG X K, et al. Recent Developments on Espnet Toolkit Boosted by Conformer // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Washington, USA: IEEE, 2021: 5874-5878. [29] QIAN Y M, BI M X, TAN T, et al. Very Deep Convolutional Neural Networks for Noise Robust Speech Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24(12): 2263-2276. [30] KRIMAN S, BELIAEV S, GINSBURG B, et al. QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Washington, USA: IEEE, 2020: 6124-6128. [31] LIKHOMANENKO T, XU Q T, PRATAP V, et al. Rethinking Evaluation in ASR: Are Our Models Robust Enough?[C/OL]. [2022-08-17]. https://arxiv.org/pdf/2010.11745.pdf. [32] SYNNAEVE G, XU Q T, KAHN J K, et al. End-to-End ASR: From Supervised to Semi-Supervised Learning with Modern Architectures[C/OL].[2022-10-17]. https://arxiv.org/pdf/1911.08460.pdf. [33] RAO K, SAK H, PRABHAVALKAR R. Exploring Architectures, Data and Units for Streaming End-to-End Speech Recognition with RNN-Transducer // Proc of the IEEE Automatic Speech Recognition and Understanding Workshop. Washington, USA: IEEE, 2017: 193-199. [34] HAN W, ZHANG Z D, ZHANG Y, et al. ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context[C/OL].[2022-10-17]. https://arxiv.org/pdf/2005.03191.pdf.