Abstract:Crawling all deep web data is difficult for third party applications due to dynamicity, autonomy and quantity of deep web data sources. To tackle the deep web crawling problem under the query type restriction(only top-k queries are allowed) and limited query resources, an approach for incremental web crawling with top-k query constraint is proposed. Historical data and domain knowledge are combined to maximize total repository data quality. Firstly, valid queries are generated using a query tree, and changes and corresponding cost of the query are estimated by historical data and domain knowledge. Next, grounded on the query cost and data quality of the estimation, the optimal subset is selected approximately to globally maximize total data quality under limited query resources. The experimental results on real datasets show the proposed approach improves the efficiency of crawling dynamic web database.
[1] DASGUPTA A, JIN X, JEWELL B, et al. Unbiased Estimation of Size and Other Aggregates over Hidden Web Databases // Proc of the ACM SIGMOD International Conference on Management of Data. New York, USA: ACM, 2010: 855-866. [2] SHENG C, ZHANG N, TAO Y F, et al. Optimal Algorithms for Crawling a Hidden Database in the Web. Proceedings of the VLDB Endowment, 2012, 5(11): 1112-1123. [3] RAGHAVAN S, GARCIA-MOLINA H. Crawling the Hidden Web // Proc of the 27th International Conference on Very Large Data Bases. San Francisco, USA: Morgan Kaufmann, 2001: 129-138. [4] NTOULAS A, ZERFOS P, CHO J. Downloading Textual Hidden Web Content through Keyword Queries // Proc of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries. New York, USA: IEEE, 2005: 100-109. [5] WU P, WEN J R, LIU H, et al. Query Selection Techniques for Efficient Crawling of Structured Web Sources // Proc of the 22nd International Conference on Data Engineering. Washington, USA: IEEE, 2006. DOI: 10.1109/ICDE.2006.124. [6] OLSTON C, NAJORK M. Web Crawling. Foundations and Trends in Information Retrieval, 2010, 4(3): 175-246. [7] TAN Q Z, MITRA P. Clustering-Based Incremental Web Crawling. ACM Transactions on Information Systems, 2010, 28(4). DOI: 10.1145/1852102.1852103. [8] CHO J, GARCIA-MOLINA H. The Evolution of the Web and Implications for an Incremental Crawler // Proc of the 26th International Conference on Very Large Data Bases. San Francisco, USA: Morgan Kaufmann, 2000: 200-209. [9] CHO J, GARCIA-MOLINA H. Synchronizing a Database to Improve Freshness. ACM SIGMOD Record, 2000, 29(2): 117-128. [10] YANG M H, WANG H X, LIM L, et al. Optimizing Content Freshness of Relations Extracted from the Web Using Keyword Search // Proc of the ACM SIGMOD International Conference on Management of Data. New York, USA: ACM, 2010: 819-830. [11] LIU W, XIAO J G. Incremental Structured Web Database Crawling via History Versions // Proc of the International Conference on Web Information Systems Engineering. Berlin, Germany: Sprin-ger, 2010: 524-533.
[12] LIU W, XIAO J G, YANG J W. A Sample-Guided Approach to Incremental Structured Web Database Crawling // Proc of the IEEE International Conference on Information and Automation. New York, USA: IEEE, 2010: 890-895. [13] HUANG Q Y, LI Q Z, LI H, et al. An Approach to Incremental Deep Web Crawling Based on Incremental Harvest Model. Procedia Engineering, 2012, 29: 1081-1087. [14] LIU W M, THIRUMURUGANATHAN S, ZHANG N, et al. Aggregate Estimation over Dynamic Hidden Web Databases. Proceedings of the VLDB Endowment, 2014, 7(12): 1107-1118. [15] VAZIRANI V V. Approximation Algorithms. Berlin, Germany: Springer, 2003. [16] REKATSINAS T, DONG X L, SRIVASTAVA D. Characterizing and Selecting Fresh Data Sources // Proc of the ACM SIGMOD International Conference on Management of Data. New York, USA: ACM, 2014: 919-930.