基于Top-<i>k</i>查询约束的深网增量爬取<sup>*</sup>

doi:10.16451/j.cnki.issn1003-6059.201701005

Abstract
Figure/Table
References
Related Citation (1)

Download: PDF (892 KB) HTML (1 KB)
Export: BibTeX | EndNote (RIS)

Abstract Crawling all deep web data is difficult for third party applications due to dynamicity, autonomy and quantity of deep web data sources. To tackle the deep web crawling problem under the query type restriction(only top-k queries are allowed) and limited query resources, an approach for incremental web crawling with top-k query constraint is proposed. Historical data and domain knowledge are combined to maximize total repository data quality. Firstly, valid queries are generated using a query tree, and changes and corresponding cost of the query are estimated by historical data and domain knowledge. Next, grounded on the query cost and data quality of the estimation, the optimal subset is selected approximately to globally maximize total data quality under limited query resources. The experimental results on real datasets show the proposed approach improves the efficiency of crawling dynamic web database.

Key words： Top-k Query Web Database Crawling Data Quality Query Cost Query Selection

Received: 10 September 2016

ZTFLH:

TP 311

About author:: JIANG Junyan, born in 1987, Ph.D. candidate. His research interests include Web data management.PENG Zhiyong, born in 1963, Ph.D., professor. His research interests include complex data management, trusted data management and Web data management.WU Xiaoying(Corresponding author), born in 1973, Ph. D., associate professor. Her research interests include data management, query processing and optimization, keyword query, pattern mining, semantic web, and data integration.

	Service

	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	JIANG Junyan
	PENG Zhiyong
	WU Xiaoying

Cite this article:

JIANG Junyan,PENG Zhiyong,WU Xiaoying. Incremental Deep Web Crawling with Top-k Query Constraint[J]. , 2017, 30(1): 43-53.

URL:

http://manu46.magtech.com.cn/Jweb_prai/EN/10.16451/j.cnki.issn1003-6059.201701005 OR http://manu46.magtech.com.cn/Jweb_prai/EN/Y2017/V30/I1/43

[1] DASGUPTA A, JIN X, JEWELL B, et al. Unbiased Estimation of Size and Other Aggregates over Hidden Web Databases // Proc of the ACM SIGMOD International Conference on Management of Data. New York, USA: ACM, 2010: 855-866.
[2] SHENG C, ZHANG N, TAO Y F, et al. Optimal Algorithms for Crawling a Hidden Database in the Web. Proceedings of the VLDB Endowment, 2012, 5(11): 1112-1123.
[3] RAGHAVAN S, GARCIA-MOLINA H. Crawling the Hidden Web // Proc of the 27th International Conference on Very Large Data Bases. San Francisco, USA: Morgan Kaufmann, 2001: 129-138.
[4] NTOULAS A, ZERFOS P, CHO J. Downloading Textual Hidden Web Content through Keyword Queries // Proc of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries. New York, USA: IEEE, 2005: 100-109.
[5] WU P, WEN J R, LIU H, et al. Query Selection Techniques for Efficient Crawling of Structured Web Sources // Proc of the 22nd International Conference on Data Engineering. Washington, USA: IEEE, 2006. DOI: 10.1109/ICDE.2006.124.
[6] OLSTON C, NAJORK M. Web Crawling. Foundations and Trends in Information Retrieval, 2010, 4(3): 175-246.
[7] TAN Q Z, MITRA P. Clustering-Based Incremental Web Crawling. ACM Transactions on Information Systems, 2010, 28(4). DOI: 10.1145/1852102.1852103.
[8] CHO J, GARCIA-MOLINA H. The Evolution of the Web and Implications for an Incremental Crawler // Proc of the 26th International Conference on Very Large Data Bases. San Francisco, USA: Morgan Kaufmann, 2000: 200-209.
[9] CHO J, GARCIA-MOLINA H. Synchronizing a Database to Improve Freshness. ACM SIGMOD Record, 2000, 29(2): 117-128.
[10] YANG M H, WANG H X, LIM L, et al. Optimizing Content Freshness of Relations Extracted from the Web Using Keyword Search // Proc of the ACM SIGMOD International Conference on Management of Data. New York, USA: ACM, 2010: 819-830.
[11] LIU W, XIAO J G. Incremental Structured Web Database Crawling via History Versions // Proc of the International Conference on Web Information Systems Engineering. Berlin, Germany: Sprin-ger, 2010: 524-533.

[12] LIU W, XIAO J G, YANG J W. A Sample-Guided Approach to Incremental Structured Web Database Crawling // Proc of the IEEE International Conference on Information and Automation. New York, USA: IEEE, 2010: 890-895.
[13] HUANG Q Y, LI Q Z, LI H, et al. An Approach to Incremental Deep Web Crawling Based on Incremental Harvest Model. Procedia Engineering, 2012, 29: 1081-1087.
[14] LIU W M, THIRUMURUGANATHAN S, ZHANG N, et al. Aggregate Estimation over Dynamic Hidden Web Databases. Proceedings of the VLDB Endowment, 2014, 7(12): 1107-1118.
[15] VAZIRANI V V. Approximation Algorithms. Berlin, Germany: Springer, 2003.
[16] REKATSINAS T, DONG X L, SRIVASTAVA D. Characterizing and Selecting Fresh Data Sources // Proc of the ACM SIGMOD International Conference on Management of Data. New York, USA: ACM, 2014: 919-930.