MapReduce框架下的不确定数据Top-k查询计算

摘要
图/表
参考文献
相关文章 (5)

全文: PDF (698 KB) HTML (0 KB)
输出: BibTeX | EndNote (RIS)

摘要 Top-k查询是不确定性数据管理中普遍采用的一种技术。基于参数化排名函数的Top-k查询语义是近年来提出的各种查询语义的统一。文中针对海量不确定数据，提出一种基于MapReduce框架的Top-k计算的有效方法。通过分析基于参数化排名函数的不确定数据Top-k查询语义，设计一种获得未计算元组的排名函数值上界的算法，避免计算所有元组的排名函数值，解决Top-k计算中的剪枝问题。在MapReduce计算模型中提出两种不同的策略来实现该算法。文中针对单机环境和Hadoop分布式计算平台进行两组不同的对比实验。实验表明在处理海量不确定数据时，该算法在计算时间上有较高的性能提升。

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	卢鑫
	陈华辉
	董一鸿
	钱江波

关键词 ：不确定数据, Top-k查询, MapReduce

Abstract：Top-k query is commonly used in the management and application on uncertain data. And the Top-k query semantics base on parameterized ranking functions (PRF) is the unified approach of various query semantics proposed in recent years. Aiming at the massive uncertain dataset,an effective method for the Top-k query based on MapReduce is proposed. Through the analysis on the Top-k query semantics of parameterized ranking functions,an algorithm is presented to get the upper bound of an un-retrieved tuple. In this way,the pruning strategy is used to get the Top-k tuples without retrieving every tuple in the dataset. Furthermore,two different strategies are presented to implement the proposed algorithm under the MapReduce computing model in Hadoop. Finally,two groups of experiments are performed aiming at a single-machine environment and the Hadoop distributed computing platform. The experimental results show that the proposed algorithm is more effective to deal with the Top-k queries for the massive uncertain data on running time.

Key words： Uncertain Data Top-k Query MapReduce

收稿日期: 2012-10-12

ZTFLH:

TP311.13

基金资助:国家自然科学基金项目(No.60973047)、浙江省自然科学基金项目(No.Y1091189)、浙江省公益性技术应用研究计划项目(No.2011C21076)、宁波市自然科学基金项目(No.2009A610072)、宁波大学胡岚博士基金项目(No.2011277)资助

作者简介: 卢鑫(通讯作者)，男，1987年生，硕士研究生，主要研究方向为数据挖掘、云计算.E-mail:koberichard@126.com.陈华辉，男，1964年生，博士，副教授，主要研究方向为数据库、数据挖掘.董一鸿，男，1970年生，博士，教授，主要研究方向为移动数据库、数据挖掘、人工智能、软计算.钱江波，男，1974年生，博士，副教授，主要研究方向为数据库、数据流.

引用本文:

卢鑫，陈华辉，董一鸿，钱江波. MapReduce框架下的不确定数据Top-k查询计算[J]. 模式识别与人工智能, 2013, 26(7): 695-700. LU Xin,CHEN Hua-Hui,DONG Yi-Hong,QIAN Jiang-Bo. Top-k Query Calculations on Uncertain Dataset under MapReduce Framework. , 2013, 26(7): 695-700.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/ 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2013/V26/I7/695

[1] Bohn R E,Short J E. How Much Information? [DB/OL]. [2009-09-01]. http://hmi.ucsd.edu/pdf/HMI_2009_ConsumerReport_Dec9_2009.pdf
[2] Soliman M A,Ilyas I F,Chang K C C. Top-k Query Processing in Uncertain Databases // Proc of the 23rd IEEE International Conference on Data Engineering. Istanbul,Turkey,2007: 896-905
[3] Soliman M A,Ilyas I F. Ranking with Uncertain Scores // Proc of the 25th IEEE International Conference on Data Engineering. Shanghai,China,2009: 317-328
[4] Lian Xiang,Chen Lei. Probabilistic Ranked Queries in Uncertain Databases // Proc of the 11th International Conference on Extending Database Technology. Nantes,France,2008: 511-522
[5] Hua Ming,Jian Pei,Zhang Weijie,et al. Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach // Proc of the ACM SIGMOD International Conference on Management of Data. Vancouver,Canada,2008: 673-686
[6] Hua Ming,Dei Jian,Zhang Wenjie,et al. Efficiently Answering Probabilistic Threshold Top-k Queries on Uncertain Data // Proc of the 24th IEEE International Conference on Data Engineering. Cancun,Mexico,2008: 1403-1405
[7] Hua Ming,Pei Jian,Liu Xuemin. Ranking Queries on Uncertain Data. The International Journal on Very Large Data Bases,2011,20(1): 129-153
[8] Zhang Xi,Chomicki J. On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases // Proc of the 24th IEEE International Conference on Data Engineering. Cancun,Mexico,2008: 556-563
[9] Cormode G,Li Feifei,Yi Ke. Semantics of Ranking Queries for Probabilistic Data and Expected Ranks // Proc of the 25th IEEE International Conference on Data Engineering. Shanghai,China,2009: 305-316
[10] Jestes J,Cormode G,Li Feifei,et al. Semantics of Ranking Queries for Probabilistic Data. IEEE Trans on Knowledge and Data Engineering,2011,23(12): 1903-1917
[11] Ge Tingjian,Zdonik S,Madden S. Top-k Queries on Uncertain Data: On Score Distribution and Typical Answers // Proc of the ACM SIGMOD International Conference on Management of Data. Providence,USA,2009: 375-388
[12] Li Jian,Saha B,Deshpande A. An Unified Approach to Ranking in Probabilistic Databases. The VLDB Journal,2011,20(2): 249-275
[13] Li Jian,Deshpande A. Ranking Continuous Probabilistic Datasets. Proceedings of the VLDB Endowment,2010,3(1/2): 638-649
[14] Wang Chonghai,Yuan Liyan,You Jiahuai,et al. On Pruning for Top-k Ranking in Uncertain Databases. Proceedings of the VLDB Endowment,2011,4(10): 598-609
[15] Dean J,Ghemawat S. MapReduce: Simplified Data Processing on Large Cluster. Communications of the ACM,2008,51(1): 107-113
[16] Pei Jian,Jiang Bin,Liu Xuemin,et al. Probabilistic Skylines on Uncertain Data // Proc of the 33rd International Conference on Very Large Data Bases,Vienna,Austria,2007: 15-26
[17] Dittrich J,Quiané-Ruiz J A,Jindal A,et al. Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing). Proceedings of the VLDB Endowment,2010,3(1/2): 515-529
[18] Ding Linlin,Xin Junchang,Wang Guoren,et al. Efficient Skyline Query Processing of Massive Data Based on Map-Reduce. Chinese Journal of Computers,2011,34(10): 1785-1796 (in Chinese)
(丁琳琳,信俊昌,王国仁,等.基于Map-Reduce的海量数据高效Skyline查询处理.计算机学报,2011,34(10): 1785-1796)
[19] Li Lingjuan,Zhang Min. Research on Algorithms of Mining Association Rule under Cloud Computing Environment. Computer Technology and Development,2011,21(2): 43-46 (in Chinese)
(李玲娟,张敏.云计算环境下关联规则挖掘算法的研究.计算机技术与发展,2011,21(2): 43-46)
[20] Jeffrey J,Yi Ke,Li Feifei. Building Wavelet Histograms on Large Data in MapReduce. Proceedings of the VLDB Endowment,2011,5(2): 109-120