基于Top-<i>k</i>查询约束的深网增量爬取<sup>*</sup>

doi:10.16451/j.cnki.issn1003-6059.201701005

摘要
图/表
参考文献
相关文章 (1)

全文: PDF (892 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要深网数据源的动态性、自治性和体量使第三方应用难以完全爬取所有Web数据.文中研究查询类型(仅允许Top-k查询)和查询资源约束下深网数据源爬取问题,提出基于Top-k查询约束的深网增量爬取方法,结合历史数据和领域知识,优化总体数据质量.首先基于查询树获得有效查询,利用历史数据和领域知识估计查询变化和查询代价.然后,基于估计的查询代价和数据质量,近似选择最优的查询子集最大化总体数据质量.实验表明文中方法较好地提高动态Web数据库爬取的效率和数据质量.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	江俊彦
	彭智勇
	吴小莹

关键词 ： Top-k查询, Web数据库爬取, 数据质量, 查询代价, 查询选择

Abstract：Crawling all deep web data is difficult for third party applications due to dynamicity, autonomy and quantity of deep web data sources. To tackle the deep web crawling problem under the query type restriction(only top-k queries are allowed) and limited query resources, an approach for incremental web crawling with top-k query constraint is proposed. Historical data and domain knowledge are combined to maximize total repository data quality. Firstly, valid queries are generated using a query tree, and changes and corresponding cost of the query are estimated by historical data and domain knowledge. Next, grounded on the query cost and data quality of the estimation, the optimal subset is selected approximately to globally maximize total data quality under limited query resources. The experimental results on real datasets show the proposed approach improves the efficiency of crawling dynamic web database.

Key words： Top-k Query Web Database Crawling Data Quality Query Cost Query Selection

收稿日期: 2016-09-10

ZTFLH:

TP 311

基金资助:国家自然科学基金项目(No.61232002,61202035)、武汉创新团队计划项目(No.2014070504020237)资助

作者简介: 江俊彦,男,1987年生,博士研究生,主要研究方向为Web数据管理.E-mail:jiangjy@whu.edu.cn.彭智勇,男,1963年生,博士,教授,主要研究方向为复杂数据管理、可信数据管理、Web数据管理.E-mail:peng@whu.edu.cn.吴小莹(通讯作者),女,1973年生,博士,副教授,主要研究方向为数据管理、查询处理和优化、关键字查询、模式挖掘、语义网、数据集成.E-mail:xiaoying.wu@whu.edu.cn.

引用本文:

江俊彦,彭智勇,吴小莹. 基于Top-k查询约束的深网增量爬取^*[J]. 模式识别与人工智能, 2017, 30(1): 43-53. JIANG Junyan, PENG Zhiyong, WU Xiaoying. Incremental Deep Web Crawling with Top-k Query Constraint. , 2017, 30(1): 43-53.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/10.16451/j.cnki.issn1003-6059.201701005 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2017/V30/I1/43

[1] DASGUPTA A, JIN X, JEWELL B, et al. Unbiased Estimation of Size and Other Aggregates over Hidden Web Databases // Proc of the ACM SIGMOD International Conference on Management of Data. New York, USA: ACM, 2010: 855-866.
[2] SHENG C, ZHANG N, TAO Y F, et al. Optimal Algorithms for Crawling a Hidden Database in the Web. Proceedings of the VLDB Endowment, 2012, 5(11): 1112-1123.
[3] RAGHAVAN S, GARCIA-MOLINA H. Crawling the Hidden Web // Proc of the 27th International Conference on Very Large Data Bases. San Francisco, USA: Morgan Kaufmann, 2001: 129-138.
[4] NTOULAS A, ZERFOS P, CHO J. Downloading Textual Hidden Web Content through Keyword Queries // Proc of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries. New York, USA: IEEE, 2005: 100-109.
[5] WU P, WEN J R, LIU H, et al. Query Selection Techniques for Efficient Crawling of Structured Web Sources // Proc of the 22nd International Conference on Data Engineering. Washington, USA: IEEE, 2006. DOI: 10.1109/ICDE.2006.124.
[6] OLSTON C, NAJORK M. Web Crawling. Foundations and Trends in Information Retrieval, 2010, 4(3): 175-246.
[7] TAN Q Z, MITRA P. Clustering-Based Incremental Web Crawling. ACM Transactions on Information Systems, 2010, 28(4). DOI: 10.1145/1852102.1852103.
[8] CHO J, GARCIA-MOLINA H. The Evolution of the Web and Implications for an Incremental Crawler // Proc of the 26th International Conference on Very Large Data Bases. San Francisco, USA: Morgan Kaufmann, 2000: 200-209.
[9] CHO J, GARCIA-MOLINA H. Synchronizing a Database to Improve Freshness. ACM SIGMOD Record, 2000, 29(2): 117-128.
[10] YANG M H, WANG H X, LIM L, et al. Optimizing Content Freshness of Relations Extracted from the Web Using Keyword Search // Proc of the ACM SIGMOD International Conference on Management of Data. New York, USA: ACM, 2010: 819-830.
[11] LIU W, XIAO J G. Incremental Structured Web Database Crawling via History Versions // Proc of the International Conference on Web Information Systems Engineering. Berlin, Germany: Sprin-ger, 2010: 524-533.

[12] LIU W, XIAO J G, YANG J W. A Sample-Guided Approach to Incremental Structured Web Database Crawling // Proc of the IEEE International Conference on Information and Automation. New York, USA: IEEE, 2010: 890-895.
[13] HUANG Q Y, LI Q Z, LI H, et al. An Approach to Incremental Deep Web Crawling Based on Incremental Harvest Model. Procedia Engineering, 2012, 29: 1081-1087.
[14] LIU W M, THIRUMURUGANATHAN S, ZHANG N, et al. Aggregate Estimation over Dynamic Hidden Web Databases. Proceedings of the VLDB Endowment, 2014, 7(12): 1107-1118.
[15] VAZIRANI V V. Approximation Algorithms. Berlin, Germany: Springer, 2003.
[16] REKATSINAS T, DONG X L, SRIVASTAVA D. Characterizing and Selecting Fresh Data Sources // Proc of the ACM SIGMOD International Conference on Management of Data. New York, USA: ACM, 2014: 919-930.