|
|
Scanned English Document Retrieval Based on OCR and Word Shape Coding |
XIA Yong1,2, DAI Ru-Wei2, XIAO Bai-Hua2, WANG Chun-Heng2 |
1.School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001 2.Key Laboratory of Complex System and Intelligence Science, Institute of Automation, Chinese Academy of Sciences, Beijing 100080 |
|
|
Abstract Two commonly used methods for scanned document retrieval are analyzed, namely retrieval based on optical character recognition (OCR) and retrieval based on word shape coding. A new strategy of combining these two methods based on recognition confidence is given. Furthermore, a new way for word shape coding based on typographic feature and stroke is presented and it is tolerant to fonts. Experiments are conducted based on different word indexing and the results verify the validity of the proposed method.
|
Received: 30 June 2008
|
|
|
|
|
[1] Vincent L. Google Book Search: Document Understanding on a Massive Scale // Proc of the 9th International Conference on Document Analysis and Recognition. Curitiba, Brazil, 2007, Ⅱ: 819-823 [2] Fujisawa H. A View on the Past and Future of Character and Document Recognition // Proc of the 9th International Conference on Document Analysis and Recognition. Curitiba, Brazil, 2007, Ⅰ: 3-7 [3] Kameshiro T, Hirano T, Okada Y, et al. A Document Image Retrieval Method Tolerating Recognition and Segmentation Errors of OCR Using Shape-Feature and Multiple Candidates // Proc of the 5th International Conference on Document Analysis and Recognition. Bangalore, India, 1999: 681-684 [4] Kameshiro T, Hirano T, Okada Y, et al. A Document Retrieval Method from Handwritten Characters Based on OCR and Character Shape Information // Proc of the 6th International Conference on Document Analysis and Recognition. Seattle, USA, 2001: 597-601 [5] Katsuyama K, Takebe H, Kurokawa K, et al. Highly Accurate Retrieval of Japanese Document Images through a Combination of Morphological Analysis and OCR. Proc of the SPIE, 2002, 4670: 57-67 [6] Nagasaki T, Takahashi T, Marukawa K. Document Retrieval System Tolerant of Segmentation Errors of Document Images // Proc of the 9th International Workshop on Frontiers in Handwriting Recognition. Tokyo, Japan, 2004: 280-285 [7] Gatos B, Konidaris T, Ntzios K, et al. A Segmentation-Free Approach for Keyword Search in Historical Typewritten Documents // Proc of the 8th International Conference on Document Analysis and Recognition. Seoul, Korea, 2005, Ⅰ: 54-58 [8] Lu Y, Tan C L. Information Retrieval in Document Image Databases. IEEE Trans on Knowledge and Data Engineering, 2004, 16(11): 1398-1410 [9] Huang Weihua, Tan C L, Sung S Y, et al. Word Shape Recognition for Image-Based Document Retrieval // Proc of the 8th International Conference on Image Processing. Thessaloniki, Greece, 2001, Ⅰ: 1114-1117 [10] Tan C L, Huang Weihua, Yu Zhaohui, et al. Imaged Document Text Retrieval without OCR. IEEE Trans on Pattern Analysis and Machine Intelligence, 2002, 24(6): 838-844 [11] Marinai S, Marino E, Soda G. Font Adaptive Word Indexing of Modern Printed Documents. IEEE Trans on Pattern Analysis and Machine Intelligence, 2006, 28(8): 1187-1199 [12] Liu C L, Nakagawa M. Precise Candidate Selection for Large Character Set Recognition by Confidence Evaluation. IEEE Trans on Pattern Analysis and Machine Intelligence, 2000, 22(6): 636-641 |
|
|
|