基于OCR与词形状编码的英文扫描文档检索<sup>*</sup>

摘要
图/表
参考文献
相关文章 (1)

全文: PDF (456 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要分析当前常用的两类扫描文档检索方法:基于OCR和基于词形状编码的方法.提出基于识别信度将两种方法进行有机结合的思路.基于文档排字特性和笔画特征，还提出一种词形状编码方法，对字体有较强的容忍性.针对各种标引方法进行关键词检索对比实验，实验结果表明，本文方法性能较优越.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	夏勇
	戴汝为
	肖柏华
	王春恒

关键词 ：检索对比实验, 实验结果表明, 本文方法性能较优越.关键词词形状编码, 光学字符识别(OCR), 识别信度评估, 文档检索

Abstract：Two commonly used methods for scanned document retrieval are analyzed, namely retrieval based on optical character recognition (OCR) and retrieval based on word shape coding. A new strategy of combining these two methods based on recognition confidence is given. Furthermore, a new way for word shape coding based on typographic feature and stroke is presented and it is tolerant to fonts. Experiments are conducted based on different word indexing and the results verify the validity of the proposed method.

Key words： Word Shape Coding Optical Character Recognition (OCR) Evaluation of Recognition Confidence Document Retrieval

收稿日期: 2008-06-30

ZTFLH:

TP391

基金资助:国家自然科学基金资助项目(No.60602031)

作者简介: 夏勇，男，1975年生，博士，主要研究方向为模式识别、图像处理、信息检索等.E-mail: xiayong@hit.edu.cn.戴汝为，男，1932年生，研究员，院士，主要研究方向为模式识别、综合集成理论、复杂系统等.肖柏华，男，1974年生，研究员，主要研究方向为模式识别、图像处理、信息检索等.王春恒，男，1971年生，研究员，主要研究方向为模式识别、综合集成理论、复杂系统等.

引用本文:

夏勇，戴汝为，肖柏华，王春恒. 基于OCR与词形状编码的英文扫描文档检索^*[J]. 模式识别与人工智能, 2009, 22(3): 488-493. XIA Yong, DAI Ru-Wei, XIAO Bai-Hua, WANG Chun-Heng. Scanned English Document Retrieval Based on OCR and Word Shape Coding. , 2009, 22(3): 488-493.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/ 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2009/V22/I3/488

[1] Vincent L. Google Book Search: Document Understanding on a Massive Scale // Proc of the 9th International Conference on Document Analysis and Recognition. Curitiba, Brazil, 2007, Ⅱ: 819-823
[2] Fujisawa H. A View on the Past and Future of Character and Document Recognition // Proc of the 9th International Conference on Document Analysis and Recognition. Curitiba, Brazil, 2007, Ⅰ: 3-7
[3] Kameshiro T, Hirano T, Okada Y, et al. A Document Image Retrieval Method Tolerating Recognition and Segmentation Errors of OCR Using Shape-Feature and Multiple Candidates // Proc of the 5th International Conference on Document Analysis and Recognition. Bangalore, India, 1999: 681-684
[4] Kameshiro T, Hirano T, Okada Y, et al. A Document Retrieval Method from Handwritten Characters Based on OCR and Character Shape Information // Proc of the 6th International Conference on Document Analysis and Recognition. Seattle, USA, 2001: 597-601
[5] Katsuyama K, Takebe H, Kurokawa K, et al. Highly Accurate Retrieval of Japanese Document Images through a Combination of Morphological Analysis and OCR. Proc of the SPIE, 2002, 4670: 57-67
[6] Nagasaki T, Takahashi T, Marukawa K. Document Retrieval System Tolerant of Segmentation Errors of Document Images // Proc of the 9th International Workshop on Frontiers in Handwriting Recognition. Tokyo, Japan, 2004: 280-285
[7] Gatos B, Konidaris T, Ntzios K, et al. A Segmentation-Free Approach for Keyword Search in Historical Typewritten Documents // Proc of the 8th International Conference on Document Analysis and Recognition. Seoul, Korea, 2005, Ⅰ: 54-58
[8] Lu Y, Tan C L. Information Retrieval in Document Image Databases. IEEE Trans on Knowledge and Data Engineering, 2004, 16(11): 1398-1410
[9] Huang Weihua, Tan C L, Sung S Y, et al. Word Shape Recognition for Image-Based Document Retrieval // Proc of the 8th International Conference on Image Processing. Thessaloniki, Greece, 2001, Ⅰ: 1114-1117
[10] Tan C L, Huang Weihua, Yu Zhaohui, et al. Imaged Document Text Retrieval without OCR. IEEE Trans on Pattern Analysis and Machine Intelligence, 2002, 24(6): 838-844
[11] Marinai S, Marino E, Soda G. Font Adaptive Word Indexing of Modern Printed Documents. IEEE Trans on Pattern Analysis and Machine Intelligence, 2006, 28(8): 1187-1199
[12] Liu C L, Nakagawa M. Precise Candidate Selection for Large Character Set Recognition by Confidence Evaluation. IEEE Trans on Pattern Analysis and Machine Intelligence, 2000, 22(6): 636-641