基于DOM树层次特征的多记录网页抽取<sup>*</sup>

doi:10.16451/j.cnki.issn1003-6059.201502004

摘要
图/表
参考文献
相关文章 (2)

全文: PDF (457 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要现有的多记录网页抽取方法通常是对文件对象模型(DOM)树进行整体纵向结构分析，计算的结构相似度普遍偏低，使其不能正确识别记录区域.文中提出基于DOM树层次特征的记录抽取方法，该方法利用DOM树不同层次节点的不同作用对其进行横向分析，将寻找相似子树的问题转换为寻找节点块的相似子块，最后采用双向拓展搜索非重叠重复子块进行记录分隔.实验表明该方法能抽取现有抽取器无法处理的页面，多个数据源的抽取结果验证其有效性.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	陈巧灵
	廖祥文
	魏晶晶
	陈国龙

关键词 ：信息抽取, 多记录网页, 抽取算法

Abstract：The existing multirecord webpage extraction methods usually make overall longitudinal analyses of the document object model (DOM) tree. The computional structural similarity is always low, and therefore record regions can not be identified correctly. Different from the previous work, a method named data record extraction based on DOM tree hierarchical feature (DEBHF) is proposed to make transverse analyses of the DOM tree by distinguishing different roles of nodes at different levels. Thus, the problem of searching similar sub-trees is converted into the problem of searching similar sub-blocks in data blocks. Finally, the two-way search for non-overlapped and repeated sub-blocks is adopted to segment the record regions. Experimental results show that the proposed approach can deal with webpages which can not be obtained by the existing methods and the extraction results of different data sources demonstrate its effectiveness.

Key words： Information Extraction Multirecord Webpage Extraction Algorithm

收稿日期: 2013-11-29

ZTFLH:

TP391

基金资助:国家自然科学基金青年科学基金项目(No.61300105)、教育部博士点基金联合项目(No.2012351410010)、福建省科技重大专项项目(No.2013H6012)、福州市科技计划项目(No.2013-PT-45)资助

作者简介: 陈巧灵，女，1989年生，硕士研究生，主要研究方向为Web数据挖掘.E-mail:chenql.fz@gmail.com.廖祥文(通讯作者)，男，1980年生，博士，副教授，主要研究方向为文本倾向性检索与挖掘.E-mail:liaoxw@fzu.edu.cn.魏晶晶，女，1984年生，博士研究生，主要研究方向为智能信息处理.陈国龙，男，1965年生，教授，博士生导师，主要研究方向为智能信息处理.

引用本文:

陈巧灵，廖祥文，魏晶晶，陈国龙. 基于DOM树层次特征的多记录网页抽取^*[J]. 模式识别与人工智能, 2015, 28(2): 125-131. CHEN Qiao-Ling, LIAO Xiang-Wen, WEI Jing-Jing, CHEN Guo-Long. Multirecord Webpage Extraction Based on DOM Tree Hierarchical Feature. , 2015, 28(2): 125-131.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/10.16451/j.cnki.issn1003-6059.201502004 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2015/V28/I2/125

[1] China Internet Network Information Center. The 32nd Statistical Report on Internet Development in China[EB/OL]. [ 2013-07-17]. http://www.cnnic.net.cn/hlwfzyj/hlwxzbg/hlwtjbg/201307/t20130717_40664.htm (in Chinese)
(中国互联网络信息中心.第32次中国互联网络发展状况统计报告[EB/OL]. [ 2013-07-17]. http://www.cnnic.net.cn/hlwfzyj/
hlwxzbg/hlwtjbg/201307/t20130717_40664.htm)
[2] Pretzsch S, Muthmann K, Schil A. FODEX-Towards Generic Data Extraction from Web Forums // Proc of the 26th International Conference on Advanced Information Networking and Applications. Fukuoka, Japan, 2012: 821-826
[3] Liu W, Yan H L, Xiao J G. Automatically Extracting User Reviews from Forum Sites. Computers and Mathematics with Applications, 2011, 62(7): 2779-2792
[4] Liu J, Song X Y, Jiang J T, et al. An Unsupervised Method for Author Extraction from Web Pages Containing User-Generated Content // Proc of the 21st ACM International Conference on Information and Knowledge Management. Maui, USA, 2012: 2387-2390
[5] Song X Y, Liu J, Cao Y B, et al. Automatic Extraction of Web Data Records Containing User-Generated Content // Proc of the 19th ACM International Conference on Information and Knowledge Management. Toronto, Canada, 2010: 39-48
[6] Yang J M, Cai R, Wang Y D, et al. Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums // Proc of the 18th International Conference on World Wide Web. Madrid, Spain, 2009: 181-190
[7] Van der Meer J, Frasincar F. Automatic Review Identification on the Web Using Pattern Recognition. Software: Practice and Experience, 2013, 43(12): 1415-1436
[8] Yin X X, Tan W Z, Li X, et al. Automatic Extraction of Clickable Structured Web Contents for Name Entity Queries // Proc of the 19th International Conference on World Wide Web . Raleigh, USA, 2010: 991-1000
[9] Hong J L, Tan E X, Fauzi F. Data Extraction for Search Engine Using Safe Matching // Proc of the 24th Australasian Joint Conference on Artificial Intelligence. Perth, Australia, 2011: 759-768
[10] Zhao H K, Meng W Y, Wu Z H, et al. Fully Automatic Wrapper Generation for Search Engines // Proc of the 14th International Conference on World Wide Web . Chiba, Japan, 2005: 66-75
[11] Hong J L, Siew E G, Egerton S. WMS-Extracting Multiple Sections Data Records from Search Engine Results Pages // Proc of the ACM Symposium on Applied Computing. Sierre, Switzerland, 2010: 1696-1701
[12] Liu B, Grossman R, Zhai Y H. Mining Data Records in Web Pages // Proc of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Washington, USA, 2003: 601-606
[13] Miao G X, Tatemura J C, Hsiung W P, et al. Extracting Data Records from the Web Using Tag Path Clustering // Proc of the 18th International Conference on World Wide Web. Madrid, Spain, 2009: 981-990
[14] Wang Y, Li B C, Lin C. Data Extraction from Web Forums Based on Similarity of Page Layout. Journal of Chinese Information Processing, 2010, 24(2): 68-75 (in Chinese)
(王允,李弼程,林琛.基于网页布局相似度的Web论坛数据抽取.中文信息学报, 2010, 24(2): 68-75)
[15] Yamada Y, Craswell N, Nakatoh T, et al. Testbed for Information Extraction from Deep Web // Proc of the 13th International Conference on World Wide Web. New York, USA, 2004: 346-347