Abstract:Processing huge data sets is an important topic in data mining nowadays. Although many serial or parallel algorithms have been developed to deal with huge data sets, most of them are not ideal to resolve the conflict between speed and accuracy. In this paper, the whole huge data set is partitioned into many small subsets for the advantage of distributed computing. At first, a definition of best partition is proposed. Then, a roughsetbased partition algorithm is developed to look for the best partition. Experimental results prove that the distributed information processing method based on the roughsetbased partition algorithm is an effective method in dealing with huge data sets. It is faster than original roughsetbased algorithms and its performance is as good as those processing the original data set as a whole.
覃政仁,吴渝,王国胤. 一种基于RoughSet的海量数据分割算法*[J]. 模式识别与人工智能, 2006, 19(2): 249-256.
QIN ZhengRen, WU Yu, WANG GuoYin. A Partition Algorithm for Huge Data Sets Based on Rough Set. , 2006, 19(2): 249-256.
[1] Mehta M, Agrawal R, Rissanen J. SLIQ: A Fast Scalable Classifier for Data Mining. In: Proc of the 5th International Conference on Extending Database Technology. Avignon, France, 1996, 18-32 [2] Shafer J, Agrawal R, Mehta M. SPRINT: A Scalable Parallel Classifier for Data Mining. In: Proc of the 22nd International Conference on Very Large Databases. Bombay, India, 1996, 544-555 [3] Prodromidis A L. Management of Intelligent Learning Agents in Distributed Data Mining Systems. Ph.D Dissertation. Department of Computer Science, Columbia University, New York, USA, 1999 [4] Chan P K W. An Extensible Meta-Learning Approach for Scalable and Accurate Inductive Learning. Ph.D Dissertation. Department of Computer Science, Columbia University, New York, USA, 1996 [5] Prodromidis A, Chan P, Stolfo S. Meta-Learning in Distributed Data Mining Systems: Issues and Approaches. In: Kargupta H, Chan P, eds. Advances in Distributed and Parallel Knowledge Discovery. Cambridge, UK: MIT Press, 2000, 81-114 [6] Wu X D, Zhang S C. Synthesizing High-Frequency Rules from Different Data Sources. IEEE Trans on Knowledge and Data Engineering, 2003, 15(2): 353-367 [7] Wang G Y. Rough Set Theory and Knowledge Acquisition. Xi’an, China: Xi’an Jiaotong University Press, 2001 (in Chinese) (王国胤.Rough集理论与知识获取.西安:西安交通大学出版社,2001) [8] UCI Machine Learning Repository. 2003. http://www.ics.uci.edu/~mlearn/MLRepository.html [9] Wang G Y, Zheng Z, Zhang Y. RIDAS-A Rough Set Based Intelligent Data Analysis System. In: Proc of the 1st International Conference on Machine Learning and Cybernetics. Beijing, China, 1991, 646-649