The information extraction results extracted from the semi-structured texts are coarse-grained, which results in ineffective semantic analysis. A structured information extraction method based on pattern matching is proposed. The proposed method is targeted at the web-presented semi-structured texts, and the suitable lexicon is loaded through domain recognition of the coarse-grained extraction results. Roles are mapped to the corresponding words in the word sequence according to the part of speech of the role in the patterns. Thus, the structured information can be extracted and it provides support for the accurate semantic analysis. Experiments show more accurate extraction results can be achieved by the proposed method.
[1] Wang H T, Cao C G, Gao Y. Design and Implementation of a System for Ontology-Mediated Knowledge Acquisition from Semi-structured Text. Chinese Journal of Computers, 2005, 28(12): 2010-2018 (in Chinese) (王海涛,曹存根,高 颖.基于领域本体的半结构化文本知识自动获取方法的设计和实现.计算机学报, 2005, 28(12): 2010-2018) [2] Li X D, Gu Y Q. DOM-Based Information Extraction for the Web Sources. Chinese Journal of Computers, 2002, 25(5): 526-533 (in Chinese) (李效东,顾毓清.基于DOM的Web信息提取.计算机学报, 2002, 25(5): 526-533) [3] Alani H, Kim S, Millard D E, et al. Automatic Ontology-Based Knowledge Extraction from Web Documents. IEEE Intelligent Systems, 2003, 18(1): 14-21 [4] Gottlob G, Koch C. Monadic Datalog and the Expressive Power of Languages for Web Information Extraction. Journal of the ACM, 2004, 51(1): 74-113 [5] Liu T, Che W X, Li S. Semantic Role Labeling with Maximum Entropy Classifier. Journal of Software, 2007, 18(3): 565-573 (in Chinese) (刘 挺,车万翔,李 生.基于最大熵分类器的语义角色标注.软件学报, 2007, 18(3): 565-573) [6] Zhang Z, Che H Y, Shi P F. Framework and Algorithm Model of Schema Matching Problem. Pattern Recognition and Artificial Inte-lligence, 2006, 19(6): 715-721 (in Chinese) (张 治,车皓阳,施鹏飞.模式匹配问题的描述框架与算法模型.模式识别与人工智能, 2006, 19(6): 715-721) [7] Hu J Z, Xiong C X, Shu J B, et al. An Improved Character String Pattern Matching Algorithm. Pattern Recognition and Artificial Inte-lligence, 2010, 23(1): 103-106 (in Chinese) (胡金柱,熊春秀,舒江波,等.一种改进的字符串模式匹配算法.模式识别与人工智能, 2010, 23(1): 103-106) [8] Gao J, Wang T J, Yang D Q, et al. Ontology-Based Two-Phase Semi-Automatic Web Extracting. Chinese Journal of Computers, 2004, 27(3): 310-318 (in Chinese) (高 军,王腾蛟,杨冬青,等.基于Ontology的Web内容二阶段半自动提取方法.计算机学报, 2004, 27(3): 310-318) [9] Miao J M, Zhang Q, Zhao J F. Chinese Automatic Text Categorization Based on Article Title Information. Computer Engineering, 2008, 34(20): 13-14, 17 (in Chinese) (缪建明,张 全,赵金仿.基于文章标题信息的汉语自动文本分类.计算机工程, 2008, 34(20): 13-14, 17) [10] Geng H T, Cai Q S. A Novel Automatic Email Classification Method Based on Support Vector Machines and Knowledge-Based Hybrid Features. Computer Science, 2006, 33(6): 52-54, 57 (in Chinese) (耿焕同,蔡庆生.一种基于SVM和领域综合特征的Email自动分类方法.计算机科学, 2006, 33(6): 52-54,57) [11] Chen C B. Curriculum Vitae Recognition System Based on Identification of Semi-Structured Text. Master Dissertation. Beijing, China: Beijing University of Posts and Telecommunications, 2008 (in Chinese) (陈川波.基于半结构化文本信息抽取的简历识别系统.硕士学位论文.北京:北京邮电大学, 2008) [12] Ruiz-Casado M, Alfonseca E, Castells P. Automatic Extraction of Semantic Relationships for WordNet by Means of Pattern Learning from Wikipedia // Proc of the 10th International Conference on Applications of Natural Language to Information Systems. Alicante, Spain, 2005: 67-79 [13] Oflazer K. Error-Tolerant Finite-State Recognition with Applications to Morphological Analysis and Spelling Correction. Computational Linguistics, 1996, 22(1): 73-89