用于流式语音识别的轻量化端到端声学架构

doi:10.16451/j.cnki.issn1003-6059.202303006

摘要
图/表
参考文献
相关文章 (9)

全文: PDF (705 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要在流式识别方法中,分块识别破坏并行性且消耗资源较大,而限制自注意力机制的上下文识别很难获得所有信息.由此,文中提出轻量化端到端声学架构(CFLASH-Transducer).为了获取细腻的局部特征,采用轻量化的FLASH(Fast Linear Attention with a Single Head)与卷积神经网络块结合.卷积块中采用Inception V2网络,提取语音信号多尺度的局部特征.再通过Coordinate Attention机制捕获特征的位置信息和多通道之间的相互关联.此外,采用深度可分离卷积,用于特征增强和层间平滑过渡.为了使其可流式化处理音频,采用RNN-T(Recurrent Neural Network Transducer)架构进行训练与解码.将当前块已经计算的全局注意力作为隐变量,传入后续块中,串联各块信息,保留训练的并行性和相关性,并且不会随着序列的增长而消耗计算资源.在开源数据集THCHS30上进行训练与测试,CFLASH-Transducer取得较高的识别率.并且相比离线识别,流式识别精度损失不超过1%.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	杨淑莹
	李欣

关键词 ：自动语言识别, 流式识别, Fast Linear Attention with a Single Head(FLASH), 卷积神经网络(CNN), Re-current Neural Network Transducer(RNN-T)

Abstract：In streaming recognition methods, chunked recognition destroys parallelism and consumes more resources, while contextual recognition with restricted self-attention mechanism is difficult to obtain all information.Therefore, a lightweight end-to-end acoustic recognition method based on Chunk, CFLASH-Transducer, is proposed by combining the fast linear attention with a single head(FLASH) and convolutional neural networks(CNNs) to obtain delicate local features. Inception V2 network is introduced into the convolutional block to extract multi-scale local features of the speech signal.The coordinate attention mechanism is adopted to capture the location information of the features and interconnections among multiple channels. The depthwise separable convolution is utilized for feature enhancement and smooth transition between layers. The recurrent neural network transducer(RNN-T) architecture is employed for training and decoding to process audio. Global attention computed in the current block is passed into subsequent blocks as a hidden variable, connecting the information of each block, retaining the training parallelism and correlation, and avoiding the consumption of computing resources as the sequence grows.CFLASH-Transducer achieves high recognition accuracy on the open source dataset THCHS30 with the loss of streaming recognition accuracy less than 1% compared to offline recognition.

Key words： Automatic Speech Recognition Streaming Recognition FLASH(Fast Linear Attention with a Single Head) CNN(Convolutional Neural Network) RNN-T(Recurrent Neural Network Transducer)

收稿日期: 2022-11-15

ZTFLH:	TN912.34
	TP391.4

基金资助:天津市虚拟仿真实验教学项目(No.津教政办[2019]69号)、天津理工大学校级重点教学基金项目(No.ZD20-04)资助

通讯作者: 杨淑莹,博士,教授,主要研究方向为模式识别、时间序列、语音识别等.E-mail:yangshuying@email.tjut.edu.cn.

作者简介: 李欣,硕士研究生,主要研究方向为深度学习、语音识别、自然语言处理等.E-mail:lixin9595@outlook.com.

引用本文:

杨淑莹, 李欣. 用于流式语音识别的轻量化端到端声学架构[J]. 模式识别与人工智能, 2023, 36(3): 268-279. YANG Shuying, LI Xin. Lightweight End-to-End Architecture for Streaming Speech Recognition. Pattern Recognition and Artificial Intelligence, 2023, 36(3): 268-279.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/10.16451/j.cnki.issn1003-6059.202303006 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2023/V36/I3/268