模式识别与人工智能
2025年4月5日 星期六   首 页     期刊简介     编委会     投稿指南     伦理声明     联系我们                                                                English
模式识别与人工智能  2023, Vol. 36 Issue (3): 268-279    DOI: 10.16451/j.cnki.issn1003-6059.202303006
研究与应用 最新目录| 下期目录| 过刊浏览| 高级检索 |
用于流式语音识别的轻量化端到端声学架构
杨淑莹1, 李欣1
1.天津理工大学 计算机科学与工程学院 天津 300384
Lightweight End-to-End Architecture for Streaming Speech Recognition
YANG Shuying1, LI Xin1
1. School of Computer Science and Engineering, Tianjin University of Technology, Tianjin 300384

全文: PDF (705 KB)   HTML (1 KB) 
输出: BibTeX | EndNote (RIS)      
摘要 在流式识别方法中,分块识别破坏并行性且消耗资源较大,而限制自注意力机制的上下文识别很难获得所有信息.由此,文中提出轻量化端到端声学架构(CFLASH-Transducer).为了获取细腻的局部特征,采用轻量化的FLASH(Fast Linear Attention with a Single Head)与卷积神经网络块结合.卷积块中采用Inception V2网络,提取语音信号多尺度的局部特征.再通过Coordinate Attention机制捕获特征的位置信息和多通道之间的相互关联.此外,采用深度可分离卷积,用于特征增强和层间平滑过渡.为了使其可流式化处理音频,采用RNN-T(Recurrent Neural Network Transducer)架构进行训练与解码.将当前块已经计算的全局注意力作为隐变量,传入后续块中,串联各块信息,保留训练的并行性和相关性,并且不会随着序列的增长而消耗计算资源.在开源数据集THCHS30上进行训练与测试,CFLASH-Transducer取得较高的识别率.并且相比离线识别,流式识别精度损失不超过1%.
服务
把本文推荐给朋友
加入我的书架
加入引用管理器
E-mail Alert
RSS
作者相关文章
杨淑莹
李欣
关键词 自动语言识别流式识别Fast Linear Attention with a Single Head(FLASH)卷积神经网络(CNN)Re-current Neural Network Transducer(RNN-T)    
Abstract:In streaming recognition methods, chunked recognition destroys parallelism and consumes more resources, while contextual recognition with restricted self-attention mechanism is difficult to obtain all information.Therefore, a lightweight end-to-end acoustic recognition method based on Chunk, CFLASH-Transducer, is proposed by combining the fast linear attention with a single head(FLASH) and convolutional neural networks(CNNs) to obtain delicate local features. Inception V2 network is introduced into the convolutional block to extract multi-scale local features of the speech signal.The coordinate attention mechanism is adopted to capture the location information of the features and interconnections among multiple channels. The depthwise separable convolution is utilized for feature enhancement and smooth transition between layers. The recurrent neural network transducer(RNN-T) architecture is employed for training and decoding to process audio. Global attention computed in the current block is passed into subsequent blocks as a hidden variable, connecting the information of each block, retaining the training parallelism and correlation, and avoiding the consumption of computing resources as the sequence grows.CFLASH-Transducer achieves high recognition accuracy on the open source dataset THCHS30 with the loss of streaming recognition accuracy less than 1% compared to offline recognition.
Key wordsAutomatic Speech Recognition    Streaming Recognition    FLASH(Fast Linear Attention with a Single Head)    CNN(Convolutional Neural Network)    RNN-T(Recurrent Neural Network Transducer)   
收稿日期: 2022-11-15     
ZTFLH: TN912.34  
  TP391.4  
基金资助:天津市虚拟仿真实验教学项目(No.津教政办[2019]69号)、天津理工大学校级重点教学基金项目(No.ZD20-04)资助
通讯作者: 杨淑莹,博士,教授,主要研究方向为模式识别、时间序列、语音识别等.E-mail:yangshuying@email.tjut.edu.cn.   
作者简介: 李 欣,硕士研究生,主要研究方向为深度学习、语音识别、自然语言处理等.E-mail:lixin9595@outlook.com.
引用本文:   
杨淑莹, 李欣. 用于流式语音识别的轻量化端到端声学架构[J]. 模式识别与人工智能, 2023, 36(3): 268-279. YANG Shuying, LI Xin. Lightweight End-to-End Architecture for Streaming Speech Recognition. Pattern Recognition and Artificial Intelligence, 2023, 36(3): 268-279.
链接本文:  
http://manu46.magtech.com.cn/Jweb_prai/CN/10.16451/j.cnki.issn1003-6059.202303006      或     http://manu46.magtech.com.cn/Jweb_prai/CN/Y2023/V36/I3/268
版权所有 © 《模式识别与人工智能》编辑部
地址:安微省合肥市蜀山湖路350号 电话:0551-65591176 传真:0551-65591176 Email:bjb@iim.ac.cn
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn