Abstract:Top-k query is commonly used in the management and application on uncertain data. And the Top-k query semantics base on parameterized ranking functions (PRF) is the unified approach of various query semantics proposed in recent years. Aiming at the massive uncertain dataset,an effective method for the Top-k query based on MapReduce is proposed. Through the analysis on the Top-k query semantics of parameterized ranking functions,an algorithm is presented to get the upper bound of an un-retrieved tuple. In this way,the pruning strategy is used to get the Top-k tuples without retrieving every tuple in the dataset. Furthermore,two different strategies are presented to implement the proposed algorithm under the MapReduce computing model in Hadoop. Finally,two groups of experiments are performed aiming at a single-machine environment and the Hadoop distributed computing platform. The experimental results show that the proposed algorithm is more effective to deal with the Top-k queries for the massive uncertain data on running time.
[1] Bohn R E,Short J E. How Much Information? [DB/OL]. [2009-09-01]. http://hmi.ucsd.edu/pdf/HMI_2009_ConsumerReport_Dec9_2009.pdf [2] Soliman M A,Ilyas I F,Chang K C C. Top-k Query Processing in Uncertain Databases // Proc of the 23rd IEEE International Conference on Data Engineering. Istanbul,Turkey,2007: 896-905 [3] Soliman M A,Ilyas I F. Ranking with Uncertain Scores // Proc of the 25th IEEE International Conference on Data Engineering. Shanghai,China,2009: 317-328 [4] Lian Xiang,Chen Lei. Probabilistic Ranked Queries in Uncertain Databases // Proc of the 11th International Conference on Extending Database Technology. Nantes,France,2008: 511-522 [5] Hua Ming,Jian Pei,Zhang Weijie,et al. Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach // Proc of the ACM SIGMOD International Conference on Management of Data. Vancouver,Canada,2008: 673-686 [6] Hua Ming,Dei Jian,Zhang Wenjie,et al. Efficiently Answering Probabilistic Threshold Top-k Queries on Uncertain Data // Proc of the 24th IEEE International Conference on Data Engineering. Cancun,Mexico,2008: 1403-1405 [7] Hua Ming,Pei Jian,Liu Xuemin. Ranking Queries on Uncertain Data. The International Journal on Very Large Data Bases,2011,20(1): 129-153 [8] Zhang Xi,Chomicki J. On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases // Proc of the 24th IEEE International Conference on Data Engineering. Cancun,Mexico,2008: 556-563 [9] Cormode G,Li Feifei,Yi Ke. Semantics of Ranking Queries for Probabilistic Data and Expected Ranks // Proc of the 25th IEEE International Conference on Data Engineering. Shanghai,China,2009: 305-316 [10] Jestes J,Cormode G,Li Feifei,et al. Semantics of Ranking Queries for Probabilistic Data. IEEE Trans on Knowledge and Data Engineering,2011,23(12): 1903-1917 [11] Ge Tingjian,Zdonik S,Madden S. Top-k Queries on Uncertain Data: On Score Distribution and Typical Answers // Proc of the ACM SIGMOD International Conference on Management of Data. Providence,USA,2009: 375-388 [12] Li Jian,Saha B,Deshpande A. An Unified Approach to Ranking in Probabilistic Databases. The VLDB Journal,2011,20(2): 249-275 [13] Li Jian,Deshpande A. Ranking Continuous Probabilistic Datasets. Proceedings of the VLDB Endowment,2010,3(1/2): 638-649 [14] Wang Chonghai,Yuan Liyan,You Jiahuai,et al. On Pruning for Top-k Ranking in Uncertain Databases. Proceedings of the VLDB Endowment,2011,4(10): 598-609 [15] Dean J,Ghemawat S. MapReduce: Simplified Data Processing on Large Cluster. Communications of the ACM,2008,51(1): 107-113 [16] Pei Jian,Jiang Bin,Liu Xuemin,et al. Probabilistic Skylines on Uncertain Data // Proc of the 33rd International Conference on Very Large Data Bases,Vienna,Austria,2007: 15-26 [17] Dittrich J,Quiané-Ruiz J A,Jindal A,et al. Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing). Proceedings of the VLDB Endowment,2010,3(1/2): 515-529 [18] Ding Linlin,Xin Junchang,Wang Guoren,et al. Efficient Skyline Query Processing of Massive Data Based on Map-Reduce. Chinese Journal of Computers,2011,34(10): 1785-1796 (in Chinese) (丁琳琳,信俊昌,王国仁,等.基于Map-Reduce的海量数据高效Skyline查询处理.计算机学报,2011,34(10): 1785-1796) [19] Li Lingjuan,Zhang Min. Research on Algorithms of Mining Association Rule under Cloud Computing Environment. Computer Technology and Development,2011,21(2): 43-46 (in Chinese) (李玲娟,张 敏.云计算环境下关联规则挖掘算法的研究.计算机技术与发展,2011,21(2): 43-46) [20] Jeffrey J,Yi Ke,Li Feifei. Building Wavelet Histograms on Large Data in MapReduce. Proceedings of the VLDB Endowment,2011,5(2): 109-120