模式识别与人工智能
Thursday, Apr. 10, 2025 Home      About Journal      Editorial Board      Instructions      Ethics Statement      Contact Us                   中文
  2013, Vol. 26 Issue (7): 667-672    DOI:
Orignal Article Current Issue| Next Issue| Archive| Adv Search |
Web Content Extraction Based on Text Density Model
ZHU Ze-De1,2,LI Miao2,ZHANG Jian2,CHEN Lei2,ZENG Xin-Hua2
1.Department of Automation,University of Science and Technology of China,Hefei 230026
2.Institute of Intelligent Machines,Chinese Academy of Sciences,Hefei 230031

Download: PDF (417 KB)   HTML (0 KB) 
Export: BibTeX | EndNote (RIS)      
Abstract  In order to obtain useful content encompassed by a large number of irrelevant information,the content extraction becomes indispensable for web data application. An approach of web content extraction based on the text density model is proposed,which integrates page structure features with language features to convert text lines of page document into a positive or negative density sequence. Additionally,the Gaussian smoothing technique is used to revise the density sequence,which takes the content continuity of adjacent lines into consideration. Finally,the improved maximum sequence segmentation is adopted to split the sequence and extract web content. Without any human intervention or repeated trainings,this approach maintains the integrity of content and eliminates noise disturbance. The experimental results indicate that the web content extraction based on the text density model is widely adapted to different data sources,and both accuracy and recall rate of the proposed approach are better than those existing statistical models.
Key wordsWeb Mining      Content Extraction      Text Density      Gaussian Smoothing      Maximum Subsequence     
Received: 30 August 2012     
ZTFLH: TP391  
Service
E-mail this article
Add to my bookshelf
Add to citation manager
E-mail Alert
RSS
Articles by authors
ZHU Ze-De
LI Miao
ZHANG Jian
CHEN Lei
ZENG Xin-Hua
Cite this article:   
ZHU Ze-De,LI Miao,ZHANG Jian等. Web Content Extraction Based on Text Density Model[J]. , 2013, 26(7): 667-672.
URL:  
http://manu46.magtech.com.cn/Jweb_prai/EN/      OR     http://manu46.magtech.com.cn/Jweb_prai/EN/Y2013/V26/I7/667
Copyright © 2010 Editorial Office of Pattern Recognition and Artificial Intelligence
Address: No.350 Shushanhu Road, Hefei, Anhui Province, P.R. China Tel: 0551-65591176 Fax:0551-65591176 Email: bjb@iim.ac.cn
Supported by Beijing Magtech  Email:support@magtech.com.cn