模式识别与人工智能
Friday, Apr. 4, 2025 Home      About Journal      Editorial Board      Instructions      Ethics Statement      Contact Us                   中文
  2015, Vol. 28 Issue (2): 125-131    DOI: 10.16451/j.cnki.issn1003-6059.201502004
Papers and Reports Current Issue| Next Issue| Archive| Adv Search |
Multirecord Webpage Extraction Based on DOM Tree Hierarchical Feature
CHEN Qiao-Ling, LIAO Xiang-Wen, WEI Jing-Jing, CHEN Guo-Long
College of Mathematics and Computer Science, Fuzhou University, Fuzhou 350108

Download: PDF (457 KB)   HTML (1 KB) 
Export: BibTeX | EndNote (RIS)      
Abstract  The existing multirecord webpage extraction methods usually make overall longitudinal analyses of the document object model (DOM) tree. The computional structural similarity is always low, and therefore record regions can not be identified correctly. Different from the previous work, a method named data record extraction based on DOM tree hierarchical feature (DEBHF) is proposed to make transverse analyses of the DOM tree by distinguishing different roles of nodes at different levels. Thus, the problem of searching similar sub-trees is converted into the problem of searching similar sub-blocks in data blocks. Finally, the two-way search for non-overlapped and repeated sub-blocks is adopted to segment the record regions. Experimental results show that the proposed approach can deal with webpages which can not be obtained by the existing methods and the extraction results of different data sources demonstrate its effectiveness.
Key wordsInformation Extraction      Multirecord Webpage      Extraction Algorithm     
Received: 29 November 2013     
ZTFLH: TP391  
Service
E-mail this article
Add to my bookshelf
Add to citation manager
E-mail Alert
RSS
Articles by authors
CHEN Qiao-Ling
LIAO Xiang-Wen
WEI Jing-Jing
CHEN Guo-Long
Cite this article:   
CHEN Qiao-Ling,LIAO Xiang-Wen,WEI Jing-Jing等. Multirecord Webpage Extraction Based on DOM Tree Hierarchical Feature[J]. , 2015, 28(2): 125-131.
URL:  
http://manu46.magtech.com.cn/Jweb_prai/EN/10.16451/j.cnki.issn1003-6059.201502004      OR     http://manu46.magtech.com.cn/Jweb_prai/EN/Y2015/V28/I2/125
Copyright © 2010 Editorial Office of Pattern Recognition and Artificial Intelligence
Address: No.350 Shushanhu Road, Hefei, Anhui Province, P.R. China Tel: 0551-65591176 Fax:0551-65591176 Email: bjb@iim.ac.cn
Supported by Beijing Magtech  Email:support@magtech.com.cn