模式识别与人工智能  2011, Vol. 24 Issue (4): 561-566    DOI:
Query Expansion Based High Performance Chinese Voice Retrieval
LI Wei, WU Ji, L Ping
Department of Electronic Engineering,Tsinghua University,Beijing 100084

摘要 中文语音检索系统用于快速准确地在中文语音文档中定位用户查询。典型实现方案对语音文档进行识别后建立索引,对查询串进行分词并以分词结果检索。检索过程中出现的查询分词与识别结果不匹配将影响系统性能。为解决该问题,产生多种查询分词结果,并对其进行前后缀扩展后检索。为解决因扩展带来的检索内容过多,用时较长的问题,引入有穷自动机压缩扩展,在此基础上设计基于令牌的搜索算法高效检索。实验证明,对查询的多分词与前后缀扩展可以使检索EER相对提升50%~70%,引入FSA可压缩检索空间,使得检索速度提升近30倍。
关键词 中文语音检索分词查询扩展有穷自动机基于令牌的搜索    
Abstract:The aim of Chinese voice retrieval systems is to locate query texts in audio files fast and precisely. In a typical implementation of the system, voice files are recognized and stored in index. The system segments each query into a word sequence and uses the sequence to search. The mismatch between query segmentation and recognition can influence systems performance. To solve this problem, multiple segmentation results and prefix-suffix expansions have been used to broaden the original query. The retrieval process is on the basis of the expansions outputs. Query expansion generates a lot of outputs, which slows down the retrieval speed. In order to increase the systems efficiency, the Finite State Automata (FSA) is introduced to compress query expansions. And a Token-based search algorithm is used for fast search. Experimental results show that the query expansion leads the systems EER to improve about 50%~70% relatively. The FSA compresses the retrieval space, and raises the retrieval speed nearly 30 times.
Key wordsChinese Speech Retrieval    Word Segmentation    Query Expansion    Finite State Automata (FSA)    Token-based Search   
收稿日期: 2010-09-25     
ZTFLH: TP319.3  
作者简介: 李伟,男,1981年生,博士,主要研究方向为面向内容的语音搜索。E-mail:w-li-06@mails。tsinghua。edu。cn。吴及,男,1973年生,副教授,主要研究方向为语音识别、多媒体信息处理。吕萍,女,1974年生,副研究员,主要研究方向为语音信号处理。
李伟吴,吕萍. 基于查询扩展的中文语音高效检索[J]. 模式识别与人工智能, 2011, 24(4): 561-566. LI Wei, WU Ji, L Ping. Query Expansion Based High Performance Chinese Voice Retrieval. , 2011, 24(4): 561-566.
