|
|
Data Extraction from Limited Deep Web Based on Latticial Space |
ZHANG Zhuo1, LI Shi-Jun1, ZHANG Nai-Zhou1,2, TIAN Jian-Wei1 |
1.School of Computer, Wuhan University, Wuhan 430079 2.Department of Computer Science, Zhixing College of HuBei University, Wuhan 430072 |
|
|
Abstract In the situation of crawling Deep Web database that limits the number of results, the problem of appropriately predicting the results size of queries can be modeled as a set covering problem with condition of limited set size. This problem is modeled as a concept covering problem. Firstly, the relation among all couples composed by a query and its result is proved as tolerance. Secondly, set of them is proved as a complete lattice which is homomorphism to the concept lattice from the same source. Therefore, the order relation between concepts can be utilized to describe correlation between queries. The intent of a concept can be considered as a query, thus the result size is forecasted by cardinality of the concept extent. A lattice-based algorithm is proposed for data extraction from limited Deep Web database, called Ladeldew. Semi-lattice pruned based on the cardinality of extent is exploited by Ladeldew as search space. The new search space is iteratively generated from new data until nothing can be extracted. Both controlled and real experiments are implemented to evaluate Ladeldew, and the results verify its theoretical correction and realistic application.
|
Received: 16 December 2009
|
|
|
|
|
[1] Bergman M K. The Deep Web: Surfacing Hidden Value (White Paper). Journal of Electronic Publishing, 2001, 7(1): 1-17 [2] Kautz H A, Selman B, Shah M A. The Hidden Web. AI Magazine, 1997, 18(2): 27-36 [3] Madhavan J, Afanasiev L, Antova L, et al. Harnessing the Deep Web: Present and Future [EB/OL]. [2009-01-07]. www-db.cs.wise.edu/cidr/cidr2009/Paper_115.pdf [4] The DBLP Computer Science Bibliography [DB/OL]. [2010-06-01]. http://www.informatik.uni-trier.de/~ley/db/index.html [5] Madhavan J, Ko D, Kot L, et al. Googles Deep Web Crawl. Proc of the VLDB Endowment, 2008, 1(2): 1241-1252 [6] Chang K C C, He B, Zhang Zhen. Toward Large Scale Integration: Building a MetaQuerier over Databases on the Web // Proc of the 2nd Biennial Conference on Innovative Data Systems Research. Asilomar, USA, 2005: 44-55 [7] Liu Wei, Meng Xiaofeng, Meng Weiyi. A Survey of Deep Web Data Integration. Chinese Journal of Computers, 2007, 30(9): 1475-1489 (in Chinese) (刘 伟,孟小峰,孟卫一.Deep Web 数据集成研究综述.计算机学报, 2007, 30(9): 1475-1489) [8] Wu Ping, Wen Jirong, Liu Huan, et al. Query Selection Techniques for Efficient Crawling of Structured Web Sources //Proc of the 22nd International Conference on Data Engineering. Atlanta, USA, 2006: 47-56 [9] Vieira K, Barbosa L, Freire J, et al. Siphon++: A Hidden-Web Crawler for Keyword-Based Interfaces // Proc of the 17th ACM Conference on Information and Knowledge Management. Napa Valley, USA, 2008: 1361-1362 [10] Liu Wei, Meng Xiaofeng, Ling Yanyan. A Graph-Based Approach for Web Database Sampling. Journal of Software, 2008, 19(2): 179-193 (in Chinese) (刘 伟,孟小峰,凌妍妍.一种基于图模型的Web 数据库采样方法.软件学报, 2008, 19(2): 179-193) [11] Wang Yan, Lu Jianguo, Chen J. Crawling Deep Web Using a New Set Covering Algorithm // Proc of the 5th International Conference on Advanced Data Mining and Applications. Beijing, China, 2009: 326-337 [12] Ganter B, Wille R. Formal Concept Analysis: Mathematical Foundations. Berlin, Germany: Springer-Verlag, 1999 [13] Wang Liming, Zhang Zhuo. Algorithm for Closed Frequent Itemsets Mining Based on Apposition Assembly of Iceberg Concept Lattices. Journal of Computer Research and Development, 2007, 44(7): 1184-1190 (in Chinese) (王黎明,张 卓.基于iceberg概念格并置集成的闭频繁项集挖掘算法.计算机研究与发展, 2007, 44(7): 1184-1190) [14] Cimiano P, Hotho A, Staab S. Learning Concept Hierarchies from Text Corpora Using Formal Concept Analysis. Journal of Artificial Intelligence Research, 2005, 24(1): 305-339 [15] Valtchev P, Missaoui R, Lebrun P. A Partion-Based Approach towards Constructing Galois (Concept) Lattices. Discrete Mathematics, 2002, 256(3): 801-829 [16] Godin R, Missaoui R, Alaoui H. Incremental Concept Formation Algorithms Based on Galois (Concept) Lattices. Computational Intelligence, 1995, 11(2): 246-267 [17] Kuznetsov S, Obiedkov S. Comparing Performance of Algorithms for Generating Concept Lattices. Journal of Experimental and Theoretical Artificial Intelligence, 2002, 14(2/3): 189-216 |
|
|
|