Distributed WEB Information Retrieval Based on Link Partition
ZHANG Gang1,2, WANG Bin1, WU LiHui1
1.Software Division, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100080 2.Graduate School of Chinese Academy of Sciences, Beijing 100039
Abstract:Distributed information retrieval is an effective way for large scale WEB information retrieval. A link based clustering algorithm ( LIBCA) is proposed for document partition. The BloomFilter Algorithm is selected to improve the efficiency of LIBCA. CORI collection selection algorithm and OKAPI BM25 are used in the process of distributed information retrieval. Based on TREC WEB dataset for the recent three years, a performance comparison is performed among the methods of link based distributed information retrieval, centralized retrieval, and random based distributed information retrieval. The experiment indicates that at P@10 the results of link partition based distributed WEB information retrieval are equal or even better than that of centralized retrieval. The efficiency experimental results indicate that the LIBCA plus BloomFiltern achieves a high system performance and it can deal with large dataset.
张刚,王斌,吴丽辉. 基于链接划分的分布式WEB信息检索*[J]. 模式识别与人工智能, 2007, 20(4): 519-524.
ZHANG Gang , WANG Bin , WU LiHui. Distributed WEB Information Retrieval Based on Link Partition. , 2007, 20(4): 519-524.
[1] Callan J P, Lu Zhihong, Croft W B. Searching Distributed Collections with Inference Networks // Proc of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Seattle, USA, 1995: 2128 [2] French J C, Powell A L, Viles C I, et al. Evaluating Database Selection Techniques: A Testbed and Experiment // Proc of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Melbourne, Australia, 1998: 121129 [3] Xu Jinxi, Croft W B. ClusterBased Language Models for Distributed Retrieval // Proc of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Berkeley, USA, 1999: 254261 [4] Small H. CoCitation in the Scientific Literature: A New Measure of the Relationship between Two Documents. Journal of the American Society for Information Science, 1973, 24(4):265269 [5] Kessler M M. Bibliographic Coupling between Scientific Papers. American Documentation, 1963, 14(1): 1025 [6] Amsler R. Application of CitationBased Automatic Classification. Technical Report. Austin, USA: The University of Texas at Austin. Linguistics Research Center, 1972 [7] Callan J. Distributed Information Retrieval // Croft W B, ed. Advances in Informational Retrieval. Dordrecht, Netherlands: Kluwer Academic Publishers, 2001: 127150 [8] Robertson S E, Walker S, Jones S. Okapi at TREC3 // Proc of the 3rd Text Retrieval Conference. Washington, USA, 1994: 109126 [9] Bloom B. Space/Time TradeOffs in Hash Coding with Allowable Errors. Communications of the ACM, 1970, 13(7): 422426