A Multi-record Webpage Attribute Extraction Method Combining Active Learning
WEI Jingjing1,2, LIAO Xiangwen3,4, CHEN Qiaoling3,4, MA Feixiang3,4, CHEN Guolong3,4
1.College of Physics and Information Engineering, Fuzhou University, Fuzhou 350116 2.College of Electronics and Information Science, Fujian Jiangxia University, Fuzhou 350108 3.College of Mathematics and Computer Science, Fuzhou University, Fuzhou 350116 4.Fujian Provincial Key Laboratory of Network Computing and Intelligent Information Processing,Fuzhou University, Fuzhou 350116
Abstract:The attribute extraction process can be separated into two phases, alignment and annotation. In the existing alignment methods, different semantic attributes are mistakenly aligned into the same group. Furthermore, to improve the accuracy of semantic annotation, time-consuming manual annotation is often introduced to construct training set. To solve this problem, a multi-record webpage attribute extraction method combining active learning is presented. As for the problem of wrong attribute alignment, shallow semantic is integrated into the alignment approach to relieve the influence of same tags with different semantics. In the semantic annotation phase, textual, visual and global features are extracted for semantic classification and an active learning based SVM classifier is applied to extract structural data. Moreover, a new sample selection strategy is proposed by introducing the global sample information, and more informative samples with lower confidences are selected to be labeled. The experimental results on BBS and microblog datasets confirm the superiority the proposed method.
[1] ZHANG C Y, SUN J L. Large Scale Microblog Mining Using Distributed MB-LDA // Proc of the 21st International Conference on World Wide Web. New York, USA: ACM, 2012: 1035-1042. [2] HALEVY A, RAJARAMAN A, ORDILLE J. Data Integration: The Teenage Years // Proc of the 32nd International Conference on Very Large Data Bases. New York, USA: ACM, 2006: 9-16. [3] DING L, FININ T, JOSHI A, et al. Swoogle: A Search and Metadata Engine for the Semantic Web // Proc of the 13th ACM International Conference on Information and Knowledge Management. New York, USA: ACM, 2004: 652-659. [4] SONG X Y, LIU J, CAO Y B. Automatic Extraction of Web Data Records Containing User-Generated Content // Proc of the 19th ACM International Conference on Information and Knowledge Mana-gement. New York, USA: ACM, 2010: 39-48. [5] VAN DER MEER J, FRASINCAR F. Automatic Review Identification on the Web Using Pattern Recognition. Software: Practice and Experience, 2013, 43(12): 1415-1436. [6] 陈巧灵,廖祥文,魏晶晶,等.基于DOM树层次特征的多记录网页抽取.模式识别与人工智能, 2015, 28(2): 125-131. (CHEN Q L, LIAO X W, WEI J J, et al. Multirecord Webpage Extraction Based on DOM Tree Hierarchical Feature. Pattern Recognition and Artificial Intelligence, 2015, 28(2): 125-131.) [7] LIU J, SONG X Y, JIANG J T, et al. An Unsupervised Method for Author Extraction from Web Pages Containing User-Generated Content // Proc of the 21st ACM International Conference on Information and Knowledge Management. New York, USA: ACM, 2012: 2387-2390. [8] VENETIS P, HALEVY A, MADHAVAN J, et al. Recovering Semantics of Tables on the Web. Proceedings of the VLDB Endowment, 2011, 4(9): 528-538. [9] ALFONSECA E, PASCA M, ROBLEDO-ARNUNCIO E. Acquisition of Instance Attributes via Labeled and Related Instances // Proc of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM, 2010: 58-65. [10] ZHAI Y H, LIU B. Web Data Extraction Based on Partial Tree Alignment // Proc of the 14th International Conference on World Wide Web. New York, USA: ACM, 2005: 76-85. [11] LIU B, ZHAI Y H. NET-A System for Extracting Web Data from Flat and Nested Data Records // Proc of the 6th International Conference on Web Information Systems Engineering. Berlin, Ger-many: Springer, 2005: 487-495. [12] LIU W, YAN H L, XIAO J G. Automatically Extracting User Reviews from Forum Sites. Computers & Mathematics with Applications, 2011, 62(7): 2779-2792. [13] YANG J M, CAI R, WANG Y D, et al. Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums // Proc of the 18th International Conference on World Wide Web. New York, USA: ACM, 2009: 181-190. [14] SUN F, SONG D D, LIAO L J. DOM Based Content Extraction via Text Density // Proc of the 34th International ACM SIGIR Confe-rence on Research and Development in Information Retrieval. New York, USA: ACM, 2011: 245-254. [15] ZHU J, NIE Z Q, WEN J R, et al. Simultaneous Record Detection and Attribute Labeling in Web Data Extraction // Proc of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM, 2006: 494-503. [16] HAO Q, CAI R, PANG Y W, et al. From One Tree to a Forest: A Unified Solution for Structured Web Data Extraction // Proc of the 34th International ACM SIGIR Conference on Research and Deve-lopment in Information Retrieval. New York, USA: ACM, 2011: 775-784. [17] DASGUPTA S. Two Faces of Active Learning. Theoretical Compu-ter Science, 2011, 412(19): 1767-1781. [18] PLATT J C. Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods // SMOLA A, BARTLETT P, SCHLKOPF B, et al., eds. Advances in Large Margin Classifiers. Cambridge, USA: MIT Press, 1999: 61-74. [19] CHANG C C, LIN C J. LIBSVM: A Library for Support Vector Machines. ACM Transactions on Intelligent Systems and Technology, 2011, 2(3): 27:1-27:39.