1.Department of Automation,University of Science and Technology of China,Hefei 230026 2.Institute of Intelligent Machines,Chinese Academy of Sciences,Hefei 230031
Abstract:In order to obtain useful content encompassed by a large number of irrelevant information,the content extraction becomes indispensable for web data application. An approach of web content extraction based on the text density model is proposed,which integrates page structure features with language features to convert text lines of page document into a positive or negative density sequence. Additionally,the Gaussian smoothing technique is used to revise the density sequence,which takes the content continuity of adjacent lines into consideration. Finally,the improved maximum sequence segmentation is adopted to split the sequence and extract web content. Without any human intervention or repeated trainings,this approach maintains the integrity of content and eliminates noise disturbance. The experimental results indicate that the web content extraction based on the text density model is widely adapted to different data sources,and both accuracy and recall rate of the proposed approach are better than those existing statistical models.
朱泽德,李淼,张健,陈雷,曾新华. 基于文本密度模型的Web正文抽取[J]. 模式识别与人工智能, 2013, 26(7): 667-672.
ZHU Ze-De,LI Miao,ZHANG Jian,CHEN Lei,ZENG Xin-Hua. Web Content Extraction Based on Text Density Model. , 2013, 26(7): 667-672.
[1] Gibson D,Punera K,Tomkins A. The Volume and Evolution of Web Page Templates // Proc of the 14th International Conference on World Wide Web. Chiba,Japan,2005: 830-839 [2] Chen Yu,Ma Weiying,Zhang Hongjiang. Detecting Web Page Structure for Adaptive Viewing on Small Form Factor Devices // Proc of the 12th International Conference on World Wide Web. Budapest,Hungary,2003: 225-233 [3] Yu Shipeng,Cai Deng,Wen Jirong,et al. Improving Pseudo-Relevance Feedback in Web Information Retrieval Using Web Page Segmentation // Proc of the 12th International Conference on World Wide Web. Budapest,Hungary,2003: 11-18 [4] Uszkoreit J,Ponte J M,Popat A C,et al. Large Scale Parallel Document Mining for Machine Translation // Proc of the 23rd International Conference on Computational Linguistics. Beijing,China,2010: 1101-1109 [5] Adelberg B. NoDoSEA Tool for Semi-Automatically Extracting Structured and Semistructured Data from Text Documents // Proc of the ACM SIGMOD International Conference on Management of Data. Washington,USA,1998: 283-294 [6] Kang D K,Choi J. MetaNews: An Information Agent for Gathering News Articles on the Web // Proc of the 14th International Symposium Methodologies for Intelligent Systems. Maebashi,Japan,2003: 179-186 [7] Yang Shaohua,Lin Hailüe,Han Yanbo. Automatic Data Extraction from Template-Generated Web Pages. Journal of Software,2008,19(2): 209-223 [8] Kohlschütter C,Fankhauser P,Nejdl W. Boilerplate Detection Using Shallow Text Features // Proc of the 3th ACM International Conference on Web Search and Data Mining. New York,USA,2010: 441-450 [9] Song Ruihua,Liu Haifeng,Wen Jirong,et al. Learning Important Models for Webpage Blocks Based on Layout and Content Analysis. ACM SIGKDD Explorations Newsletter,2004,6(2): 14-23 [10] Gibson J,Wellner B,Lubar S. Adaptive Web-page Content Identification // Proc of the 9th ACM International Workshop on Web Information and Data Management. Lisbon,Portugal,2007: 105-112 [11] Ziegler C N,Skubacz M. Content Extraction from News Pages Using Particle Swarm Optimization on Linguistic and Structural Features // Proc of the IEEE/WIC/ACM International Conference on Web Intelligence. Fremont,USA,2007: 242-249 [12] Pasternack J,Roth D. Extracting Article Text from the Web with Maximum Subsequence Segmentation // Proc of the 18th International Conference on World Wide Web. Madrid,Spain,2009: 971-980 [13] Finn A,Kushmerick N,Smyth B. Fact or Fiction: Content Classification for Digital Libraries // Proc of the 2nd DELOS Network of Excellence Workshop on Personalization and Recommender Systems in Digital Libraries. Dublin,Ireland,2001: 1-6 [14] Pinto D,Branstein M,Coleman R,et al. QuASM: A System for Question Answering Using Semi-Structured Data // Proc of the 2nd ACM/IEEE-CS Joint Conference on Digital Libraries. Portland,USA,2002: 46-55 [15] Mantratzis C,Orgun M,Cassidy S. Separating XHTML Content from Navigation Clutter Using DOM-Structure Block Analysis // Proc of the 16th ACM Conference on Hypertext and Hypermedia. Salzburg,Austria,2005: 145-147 [16] Debnath S,Mitra P,Giles C L. Automatic Extraction of Informative Blocks from Webpages // Proc of the ACM Symposium on Applied Computing. Santa Fe,USA,2005: 1722-1726 [17] Gottron T. Content Code Blurring: A New Approach to Content Extraction // Proc of the 19th International Conference on Database and Expert Systems Applications. Turin,Italy,2008: 29-33 [18] Weninger T,Hsu W H,Han Jiawei. CETR-Content Extraction via Tag Ratios // Proc of the 19th International Conference on World Wide Web. Raleigh,USA,2010: 971-980