|
|
Web Information Extraction Based on Probabilistic Model |
WANG Jing,LIU Zhi-Jing |
School of Computer Science and Engineering,Xidian University,Xian 710071 |
|
|
Abstract According to the structure and the content features of web pages, a model named tree-structured hierarchical conditional random fields (TH-CRFs) is proposed. Firstly, a multi-feature vector space model is proposed to represent the features of the web pages from the facets of the page structure and the content. Secondly, the Boolean model and multi-rules are introduced to denote the features for a better representation of the web objects. Thirdly, an optimal web objects information extraction based on the TH-CRFs is performed to find out the recruitment knowledge and optimize the efficiency of the training. Finally, the proposed model is compared with the existing approaches for web objects information extraction. The experimental results show that the accuracy of the TH-CRFs for the web objects information extraction is significantly improved, and the time complexity is decreased.
|
Received: 17 August 2009
|
|
|
|
|
[1] Cui Hang, Kan M Y, Chua T S. Soft Pattern Matching Models for Definitional Question Answering. ACM Trans on Information Systems, 2007, 25(2): 1-30 [2] Nyberg E, Mitamura T, Callan J, et al. The JAVELIN Question-Answering System at TREC 2003: A Multi-Strategy Approach with Dynamic Planning // Proc of the 12th Text Retrieval Conference. Edinburgh, UK, 2003, Ⅻ: 93-108 [3] Mooney R J, Bunescu R. Mining Knowledge from Text Using Information Extraction. ACM SIGKDD Explorations Newsletter, 2005, 7(1): 3-10 [4] Kobayashi N, Iida R, Inui K, et al. Opinion Mining on the Web by Extracting Subject-Attribute-Value Relations // Proc of the AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs. California, USA, 2006: 470-481 [5] Chen Jinlin, Zhong Ping, Cook T. Detecting Web Content Function Using Generalized Hidden Markov Model // Proc of the 5th International Conference on Machine Learning and Applications. Orlando, USA, 2006: 279-284 [6] Freitag D, McCallum A. Information Extraction with HMM Structures Learned by Stochastic Optimization // Proc of the 17th National Conference on Artificial Intelligence. Austin, USA, 2000: 584-589 [7] Chieu H L, Ng H T. A Maximum Entropy Approach to Information Extraction from Semi-Structured and Free Text // Proc of the 18th National Conference on Artificial Intelligence. Edmonton, Canada, 2002: 786-791 [8] Finn A. A Multi-Level Boundary Classification Approach to Information Extraction // Proc of the 15th European Conference on Machine Learning. Pisa, Italy, 2004: 111-122 [9] Zhang Zhu. Weakly-Supervised Relation Classification for Information Extraction // Proc of the 13th ACM International Conference on Information and Knowledge Management. Washington, USA, 2004: 581 - 588 [10] Wallach H M. Conditional Random Fields: An Introduction. Technical Report, MS-CIS-04-21, Philadelphia, USA: University of Philadelphia. Department of Computer and Information Science, 2004 [11] Kristjansson T, Culotta A, Viola P, et al. Interactive Information Extraction with Constrained Conditional Random Fields // Proc of the 19th National Conference on Artificial Intelligence. San Jose, USA, 2004: 412-418 [12] Sutton C, McCallum A, Rohanimanesh K. Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data. The Journal of Machine Learning Research, 2007, 8: 693-723 [13] Tang Jie, Hong Mingcai, Li Juanzi, et al. Tree-Structured Conditional Random Fields for Semantic Annotation // Proc of the 5th International Semantic Web Conference. Athens, USA, 2006: 640-653 [14] Truyen T T, Phung D Q, Bui H H, et al. Hierarchical Semi-Markov Conditional Random Fields for Recursive Sequential Data // Proc of the 22nd Annual Conference on Neural Information Processing Systems. Vancouver, Canada, 2008: 1657-1664 [15] Cai Deng, Yu Shipeng, Wen Jirong, et al. VIPS: A Vision Based Page Segmentation Algorithm. Technical Report, MSR-TR-2003-79, Redmond, USA: Microsoft Research. Microsoft Corporation, 2003 [16] Salton G, Wong A, Yang C S. A Vector Space Model for Automatic Indexing. Communications of the ACM, 1975, 18(11): 613-620 [17] Nie Zaiqing, Zhang Yuanzhi, Wen Jirong, et al. Object-Level Ranking: Bringing Order to Web Objects // Proc of the 14th International Conference on World Wide Web. Chiba, Japan, 2005: 567-574 [18] Lafferty J, Zhu Xiaojin, Liu Yan. Kernel Conditional Random Fields: Representation and Clique Selection // Proc of the 21st International Conference on Machine learning. Banff, Canada, 2004: 64-71 [19] Wallach H. Efficient Training of Conditional Random Fields. Master Dissertation. Edinburgh, UK: University of Edinburgh. Division of Informatics, 2002 [20] Tran T T. On Conditional Random Fields: Applications, Feature Selection, Parameter Estimation and Hierarchical Modelling. Ph.D Dissertation. Curtin, Australia: University of Curtin. Department of Computing, 2008 |
|
|
|