Scanned English Document Retrieval Based on OCR and Word Shape Coding
XIA Yong1,2, DAI Ru-Wei2, XIAO Bai-Hua2, WANG Chun-Heng2
1.School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001 2.Key Laboratory of Complex System and Intelligence Science, Institute of Automation, Chinese Academy of Sciences, Beijing 100080
Abstract:Two commonly used methods for scanned document retrieval are analyzed, namely retrieval based on optical character recognition (OCR) and retrieval based on word shape coding. A new strategy of combining these two methods based on recognition confidence is given. Furthermore, a new way for word shape coding based on typographic feature and stroke is presented and it is tolerant to fonts. Experiments are conducted based on different word indexing and the results verify the validity of the proposed method.
夏勇,戴汝为,肖柏华,王春恒. 基于OCR与词形状编码的英文扫描文档检索*[J]. 模式识别与人工智能, 2009, 22(3): 488-493.
XIA Yong, DAI Ru-Wei, XIAO Bai-Hua, WANG Chun-Heng. Scanned English Document Retrieval Based on OCR and Word Shape Coding. , 2009, 22(3): 488-493.
[1] Vincent L. Google Book Search: Document Understanding on a Massive Scale // Proc of the 9th International Conference on Document Analysis and Recognition. Curitiba, Brazil, 2007, Ⅱ: 819-823 [2] Fujisawa H. A View on the Past and Future of Character and Document Recognition // Proc of the 9th International Conference on Document Analysis and Recognition. Curitiba, Brazil, 2007, Ⅰ: 3-7 [3] Kameshiro T, Hirano T, Okada Y, et al. A Document Image Retrieval Method Tolerating Recognition and Segmentation Errors of OCR Using Shape-Feature and Multiple Candidates // Proc of the 5th International Conference on Document Analysis and Recognition. Bangalore, India, 1999: 681-684 [4] Kameshiro T, Hirano T, Okada Y, et al. A Document Retrieval Method from Handwritten Characters Based on OCR and Character Shape Information // Proc of the 6th International Conference on Document Analysis and Recognition. Seattle, USA, 2001: 597-601 [5] Katsuyama K, Takebe H, Kurokawa K, et al. Highly Accurate Retrieval of Japanese Document Images through a Combination of Morphological Analysis and OCR. Proc of the SPIE, 2002, 4670: 57-67 [6] Nagasaki T, Takahashi T, Marukawa K. Document Retrieval System Tolerant of Segmentation Errors of Document Images // Proc of the 9th International Workshop on Frontiers in Handwriting Recognition. Tokyo, Japan, 2004: 280-285 [7] Gatos B, Konidaris T, Ntzios K, et al. A Segmentation-Free Approach for Keyword Search in Historical Typewritten Documents // Proc of the 8th International Conference on Document Analysis and Recognition. Seoul, Korea, 2005, Ⅰ: 54-58 [8] Lu Y, Tan C L. Information Retrieval in Document Image Databases. IEEE Trans on Knowledge and Data Engineering, 2004, 16(11): 1398-1410 [9] Huang Weihua, Tan C L, Sung S Y, et al. Word Shape Recognition for Image-Based Document Retrieval // Proc of the 8th International Conference on Image Processing. Thessaloniki, Greece, 2001, Ⅰ: 1114-1117 [10] Tan C L, Huang Weihua, Yu Zhaohui, et al. Imaged Document Text Retrieval without OCR. IEEE Trans on Pattern Analysis and Machine Intelligence, 2002, 24(6): 838-844 [11] Marinai S, Marino E, Soda G. Font Adaptive Word Indexing of Modern Printed Documents. IEEE Trans on Pattern Analysis and Machine Intelligence, 2006, 28(8): 1187-1199 [12] Liu C L, Nakagawa M. Precise Candidate Selection for Large Character Set Recognition by Confidence Evaluation. IEEE Trans on Pattern Analysis and Machine Intelligence, 2000, 22(6): 636-641