结合主动学习的多记录网页属性抽取方法<sup>*</sup>

doi:10.16451/j.cnki.issn1003-6059.201608001

摘要
图/表
参考文献
相关文章 (10)

全文: PDF (603 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要属性抽取可分为对齐和语义标注两个过程，现有对齐方法中部分含有相同标签不同语义的属性会错分到同一个组，而且为了提高语义标注的精度，通常需要大量的人工标注训练集.为此，文中提出结合主动学习的多记录网页属性抽取方法.针对属性错分问题，引入属性的浅层语义，减少相同标签语义不一致的影响.在语义标注阶段，基于网页的文本、视觉和全局特征，采用基于主动学习的SVM分类方法获得带有语义的结构化数据.同时在主动学习的策略选择方面，通过引入样本整体信息，构建基于不确定性度量的策略，选择语义分类预测不准的样本进行标注.实验表明，在论坛、微博等多个数据集上，相比现有方法，文中方法抽取效果更好.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	魏晶晶
	廖祥文
	陈巧灵
	马飞翔
	陈国龙

关键词 ：属性抽取, 语义分类, 主动学习

Abstract：The attribute extraction process can be separated into two phases, alignment and annotation. In the existing alignment methods, different semantic attributes are mistakenly aligned into the same group. Furthermore, to improve the accuracy of semantic annotation, time-consuming manual annotation is often introduced to construct training set. To solve this problem, a multi-record webpage attribute extraction method combining active learning is presented. As for the problem of wrong attribute alignment, shallow semantic is integrated into the alignment approach to relieve the influence of same tags with different semantics. In the semantic annotation phase, textual, visual and global features are extracted for semantic classification and an active learning based SVM classifier is applied to extract structural data. Moreover, a new sample selection strategy is proposed by introducing the global sample information, and more informative samples with lower confidences are selected to be labeled. The experimental results on BBS and microblog datasets confirm the superiority the proposed method.

Key words： Attribute Extraction Semantic Classification Active Learning

收稿日期: 2015-02-02

ZTFLH:

TP 391

基金资助:国家自然科学基金青年基金项目(No.61300105)、教育部博士点基金联合项目(No.2012351410010)、福建省科技重大专项项目(No.2013H6012)、福州市科技计划项目(No.2013-PT-45,2012-G-113)资助

作者简介: 魏晶晶,女,1984年生,博士研究生,主要研究方向为智能信息处理.E-mail:weijj517@163.com. 廖祥文(通讯作者),男,1980年生,博士,副教授,主要研究方向为文本倾向性检索与挖掘.E-mail:liaoxw@fzu.edu.cn. 陈巧灵,女,1989年生,硕士研究生,主要研究方向为Web数据挖掘.E-mail:chenql.fz@gmail.com. 马飞翔,男,1991年生,硕士研究生,主要研究方向为情感分析.E-mail:asoar907@gmail.com. 陈国龙,男,1965年生,博士,教授,主要研究方向为智能信息处理.E-mail:cgl@fzu.edu.cn.

引用本文:

魏晶晶，廖祥文，陈巧灵，马飞翔，陈国龙. 结合主动学习的多记录网页属性抽取方法^*[J]. 模式识别与人工智能, 2016, 29(8): 673-681. WEI Jingjing, LIAO Xiangwen, CHEN Qiaoling, MA Feixiang, CHEN Guolong. A Multi-record Webpage Attribute Extraction Method Combining Active Learning. , 2016, 29(8): 673-681.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/10.16451/j.cnki.issn1003-6059.201608001 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2016/V29/I8/673

[1] ZHANG C Y, SUN J L. Large Scale Microblog Mining Using Distributed MB-LDA // Proc of the 21st International Conference on World Wide Web. New York, USA: ACM, 2012: 1035-1042.
[2] HALEVY A, RAJARAMAN A, ORDILLE J. Data Integration: The Teenage Years // Proc of the 32nd International Conference on Very Large Data Bases. New York, USA: ACM, 2006: 9-16.
[3] DING L, FININ T, JOSHI A, et al. Swoogle: A Search and Metadata Engine for the Semantic Web // Proc of the 13th ACM International Conference on Information and Knowledge Management. New York, USA: ACM, 2004: 652-659.
[4] SONG X Y, LIU J, CAO Y B. Automatic Extraction of Web Data Records Containing User-Generated Content // Proc of the 19th ACM International Conference on Information and Knowledge Mana-gement. New York, USA: ACM, 2010: 39-48.
[5] VAN DER MEER J, FRASINCAR F. Automatic Review Identification on the Web Using Pattern Recognition. Software: Practice and Experience, 2013, 43(12): 1415-1436.
[6] 陈巧灵,廖祥文,魏晶晶,等.基于DOM树层次特征的多记录网页抽取.模式识别与人工智能, 2015, 28(2): 125-131.
(CHEN Q L, LIAO X W, WEI J J, et al. Multirecord Webpage Extraction Based on DOM Tree Hierarchical Feature. Pattern Recognition and Artificial Intelligence, 2015, 28(2): 125-131.)
[7] LIU J, SONG X Y, JIANG J T, et al. An Unsupervised Method for Author Extraction from Web Pages Containing User-Generated Content // Proc of the 21st ACM International Conference on Information and Knowledge Management. New York, USA: ACM, 2012: 2387-2390.
[8] VENETIS P, HALEVY A, MADHAVAN J, et al. Recovering Semantics of Tables on the Web. Proceedings of the VLDB Endowment, 2011, 4(9): 528-538.
[9] ALFONSECA E, PASCA M, ROBLEDO-ARNUNCIO E. Acquisition of Instance Attributes via Labeled and Related Instances // Proc of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM, 2010: 58-65.
[10] ZHAI Y H, LIU B. Web Data Extraction Based on Partial Tree Alignment // Proc of the 14th International Conference on World Wide Web. New York, USA: ACM, 2005: 76-85.
[11] LIU B, ZHAI Y H. NET-A System for Extracting Web Data from Flat and Nested Data Records // Proc of the 6th International Conference on Web Information Systems Engineering. Berlin, Ger-many: Springer, 2005: 487-495.
[12] LIU W, YAN H L, XIAO J G. Automatically Extracting User Reviews from Forum Sites. Computers & Mathematics with Applications, 2011, 62(7): 2779-2792.
[13] YANG J M, CAI R, WANG Y D, et al. Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums // Proc of the 18th International Conference on World Wide Web. New York, USA: ACM, 2009: 181-190.
[14] SUN F, SONG D D, LIAO L J. DOM Based Content Extraction via Text Density // Proc of the 34th International ACM SIGIR Confe-rence on Research and Development in Information Retrieval. New York, USA: ACM, 2011: 245-254.
[15] ZHU J, NIE Z Q, WEN J R, et al. Simultaneous Record Detection and Attribute Labeling in Web Data Extraction // Proc of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM, 2006: 494-503.
[16] HAO Q, CAI R, PANG Y W, et al. From One Tree to a Forest: A Unified Solution for Structured Web Data Extraction // Proc of the 34th International ACM SIGIR Conference on Research and Deve-lopment in Information Retrieval. New York, USA: ACM, 2011: 775-784.
[17] DASGUPTA S. Two Faces of Active Learning. Theoretical Compu-ter Science, 2011, 412(19): 1767-1781.
[18] PLATT J C. Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods // SMOLA A, BARTLETT P, SCHLKOPF B, et al., eds. Advances in Large Margin Classifiers. Cambridge, USA: MIT Press, 1999: 61-74.
[19] CHANG C C, LIN C J. LIBSVM: A Library for Support Vector Machines. ACM Transactions on Intelligent Systems and Technology, 2011, 2(3): 27:1-27:39.