A URL Filtering Generation Algorithm Based on Similarity Degree for Web Crawling
CHEN Hui-Hui1, SHU Yun-Xing1, LIN Li2
1Department of Computer and Information Engineering, Luoyang Institute of Science and Technology, Luoyang 471023 2Department of Asian African Languages, PLA University of Foreign Languages, Luoyang 471003
Abstract:Web text is an important component of the corpus, however, unnecessary time consumption for visiting redundant URLs influences the quality and efficiency of the large scale web crawling. The quality and efficiency of Web crawling can be promoted by using high effective URL filtering rules. The distribution of files in the virtual directories of a website is uneven and a URL filtering rule generation method is introduced to discover the clustering region of target files. Firstly, URLs are transformed into regular expressions and they are divided into many groups by clustering same regular expressions. Then, the similarity degrees between URLs in one group are calculated and the virtual path tree is constructed by using URLs with higher similarity degrees. Finally, the virtual path tree is utilized to generate URL filtering rules and classification rules for Web crawling. The algorithms for generating virtual path tree are introduced in detail and the experimental results of the generated virtual path trees and the filtered URLs are compared by using different similarity degree thresholds.
陈荟慧,舒云星,林丽. Web语料抓取中基于相似度的URL过滤规则生成算法*[J]. 模式识别与人工智能, 2014, 27(7): 631-637.
CHEN Hui-Hui, SHU Yun-Xing, LIN Li. A URL Filtering Generation Algorithm Based on Similarity Degree for Web Crawling. , 2014, 27(7): 631-637.
[1] Chang C H, Mohammed K, Girgis M R, et al. A Survey of Web Information Extraction Systems. IEEE Trans on Knowledge and Data Engineering, 2006, 18(10): 1411-1428 [2] Wang H C, Ruan S H, Tang Q J. The Implementation of a Web Crawler URL Filter Algorithm Based on Caching // Proc of the 2nd International Workshop on Computer Science and Engineering. Qingdao, China, 2009: 453-456 [3] Broder A Z, Najork M, Wiener J L. Efficient URL Caching for World Wide Web Crawling // Proc of the 12th International Conference on World Wide Web. Budapest, Hungary, 2003: 679-689 [4] Qu C, Wang B Z, Wei P P. Efficient Focused Crawling Strategy Using Combination of Link Structure and Content Similarity // Proc of the IEEE International Symposium on Information Technology in Medicine and Education. Xiamen, China, 2008: 1045-1048 [5] Kong Y Y, Shi H J. Deep Web Data Region Identification Based on Similar URL. Computer Engineering, 2012, 38(2): 48-50 (in Chinese) (孔燕燕,施化吉.基于相似URL的深层网数据区域识别.计算机工程, 2012, 38(2): 48-50) [6] Nie T Z, Wang Z H, Kou Y, et al. Crawling Result Pages for Data Extraction Based on URL Classification // Proc of the 7th Web Information Systems and Applications. Huhehot, China, 2010: 79-84 [7] Wang J Y, Lochovsky F H. Data-Rich Section Extraction from HTML Pages // Proc of the 3rd International Conference on Web Information Systems Engineering. Singapore, Singapore, 2002: 313-322 [8] Yang S H, Lin H L, Han Y B. Automatic Data Extraction from Template-Generated Web Pages. Journal of Software, 2008, 19(2): 209-223 (in Chinese) (杨少华,林海略,韩燕波.针对模板生成网页的一种数据自动抽取方法.软件学报, 2008, 19(2): 209-223) [9] Reis D C, Golgher P B, Silva A S, et al. Automatic Web News Extraction Using Tree Edit Distance // Proc of the 13th International Conference on World Wide Web. New York, USA, 2004: 502-511 [10] Wong W C, Fu A W C. Finding Structure and Characteristics of Web Documents for Classification // Proc of the ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. Dallas, USA, 2000: 96-105 [11] Srikantaiah K C, Suraj M, Venugopal K R, et al. Similarity Based Dynamic Web Data Extraction and Integration System from Search Engine Result Pages for Web Content Mining. ACEEE International Journal on Information Technology, 2013, 3(1): 42-49 [12] Hu L M, Zhang Z B, Xu W D, et al. Improved Crawler Algorithm Based on Hierarchical Structure Preservation. Application Research of Computers, 2013, 30(8): 2381-2385 (in Chinese) (胡廉民,张泽斌,徐威迪,等.基于分层结构保留的增量网络爬虫算法.计算机应用研究, 2013, 30(8): 2381-2385) [13] Zhang M, Sun M. Design and Implementation of Qualified Spider Based on Heritrix. Computer Applications and Software, 2013, 30(4): 33-35 (in Chinese) (张 敏,孙 敏.基于 Heritrix 限定爬虫的设计与实现.计算机应用与软件, 2013, 30(4): 33-35) [14] Chang B B, Yu S W. The Technology and Application of Corpus. Foreign Languages Research, 2009, (5): 43-51 (in Chinese) (常宝宝,俞士汶.语料库技术及其应用.外语研究, 2009, (5): 43-51)