模式识别与人工智能
Wednesday, Apr. 2, 2025 Home      About Journal      Editorial Board      Instructions      Ethics Statement      Contact Us                   中文
  2014, Vol. 27 Issue (7): 631-637    DOI:
Researches and Applications Current Issue| Next Issue| Archive| Adv Search |
A URL Filtering Generation Algorithm Based on Similarity Degree for Web Crawling
CHEN Hui-Hui1, SHU Yun-Xing1, LIN Li2
1Department of Computer and Information Engineering, Luoyang Institute of Science and Technology, Luoyang 471023
2Department of Asian African Languages, PLA University of Foreign Languages, Luoyang 471003

Download: PDF (717 KB)   HTML (1 KB) 
Export: BibTeX | EndNote (RIS)      
Abstract  Web text is an important component of the corpus, however, unnecessary time consumption for visiting redundant URLs influences the quality and efficiency of the large scale web crawling. The quality and efficiency of Web crawling can be promoted by using high effective URL filtering rules. The distribution of files in the virtual directories of a website is uneven and a URL filtering rule generation method is introduced to discover the clustering region of target files. Firstly, URLs are transformed into regular expressions and they are divided into many groups by clustering same regular expressions. Then, the similarity degrees between URLs in one group are calculated and the virtual path tree is constructed by using URLs with higher similarity degrees. Finally, the virtual path tree is utilized to generate URL filtering rules and classification rules for Web crawling. The algorithms for generating virtual path tree are introduced in detail and the experimental results of the generated virtual path trees and the filtered URLs are compared by using different similarity degree thresholds.
Key wordsURL Similarity Degree      Web Text Crawling      URL Filtering      Text Classification     
Received: 20 May 2013     
ZTFLH: TP391.1  
Service
E-mail this article
Add to my bookshelf
Add to citation manager
E-mail Alert
RSS
Articles by authors
CHEN Hui-Hui
SHU Yun-Xing
LIN Li
Cite this article:   
CHEN Hui-Hui,SHU Yun-Xing,LIN Li. A URL Filtering Generation Algorithm Based on Similarity Degree for Web Crawling[J]. , 2014, 27(7): 631-637.
URL:  
http://manu46.magtech.com.cn/Jweb_prai/EN/      OR     http://manu46.magtech.com.cn/Jweb_prai/EN/Y2014/V27/I7/631
Copyright © 2010 Editorial Office of Pattern Recognition and Artificial Intelligence
Address: No.350 Shushanhu Road, Hefei, Anhui Province, P.R. China Tel: 0551-65591176 Fax:0551-65591176 Email: bjb@iim.ac.cn
Supported by Beijing Magtech  Email:support@magtech.com.cn