Clustering Ensembles Based Classification Method for Imbalanced Data Sets
CHEN Si,GUO Gong-De,CHEN Li-Fei
School of Mathematics and Computer Science,Fujian Normal University,Fuzhou 350007 Key Laboratory of Network Security and Cryptography,Fujian Normal University,Fuzhou 350007
Abstract:Recently, classification of imbalanced data sets becomes a research hotspot in data mining and machine learning. A type of novel classification methods for imbalanced data sets based on clustering ensembles is proposed, which aims to provide a better training platform for classification methods by introducing clustering consistency index to find the cluster boundary minority examples and the cluster center majority examples. And the improved synthetic minority over-sampling technique (SMOTE) and the modified random under-sampling method are used respectively to deal with imbalanced data sets. The classifications of eight methods on some public data sets are compared. Experimental results show that the proposed methods perform better for both minority and majority classes, and are effective and feasible to deal with the imbalanced data sets.
[1] Batista G E A P A, Prati R C, Monard M C. A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 20-29 [2] Kotsiantis S B, Pintelas P E. Mixture of Expert Agents for Handling Imbalanced Data Sets. Annals of Mathematics, Computing Teleinformatics, 2003, 1(1): 46-55 [3] Kotsiantis S, Kanellopoulos D, Pintelas P. Handling Imbalanced Datasets: A Review. GESTS International Trans on Computer Science and Engineering, 2006, 30(1): 25-36 [4] Burez J, van den Poel D. Handling Class Imbalance in Customer Churn Prediction. Expert Systems with Applications, 2009, 36(3): 4626-4636 [5] Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research, 2002, 16(1): 321-357 [6] Han Hui, Wang Wenyuan, Mao Binghuan. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning // Proc of the International Conference on Intelligent Computing. Hefei, China, 2005: 878-887 [7] Guo Hongyu, Viktor H L. Learning from Imbalanced Data Sets with Boosting and Data Generation: the DataBoost-IM Approach. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 30-39 [8] Chawla N V, Lazarevic A, Hall L O, et al. SMOTEBoost: Improving Prediction of the Minority Class in Boosting // Proc of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases. Dubrovnik, Croatia, 2003: 107-119 [9] Yang Zhiming, Qiao Liyan, Peng Xiyuan. Research on Data Mining Method for Imbalanced Dataset Based on Improved SMOTE. Acta Electronica Sinica, 2007, 35(12): 22-26 (in Chinese) (杨智明,乔立岩,彭喜元.基于改进SMOTE的不平衡数据挖掘方法研究.电子学报, 2007, 35(12): 22-26) [10] Garcìa S, Herrera F. Evolutionary Undersampling for Classification with Imbalanced Datasets: Proposals and Taxonomy. Evolutionary Computation, 2009, 17(3): 275-306 [11] Joshi M V, Kumar V, Agarwal R. Evaluating Boosting Algorithms to Classify Rare Classes: Comparison and Improvements // Proc of the 1st IEEE International Conference on Data Mining. San Jose, USA, 2001: 257-264 [12] Cieslak D A, Chawla N V. Learning Decision Trees for Unbalanced Data // Proc of the European Conference on Machine Learning and Knowledge Discovery in Databases. Antwerp, Belgium, 2008: 241-256 [13] Fernández A, del Jesus M J, Herrera F. Hierarchical Fuzzy Rule Based Classification Systems with Genetic Rule Selection for Imbalanced Data-Sets. International Journal of Approximate Reasoning, 2009, 50(3): 561-577 [14] Chawla N V, Cieslak D A, Hall L O, et al. Automatically Countering Imbalance and Its Empirical Relationship to Cost. Data Mining and Knowledge Discovery, 2008, 17(2): 225-252 [15] Minaei-Bidgoli B, Topchy A, Punch W F. A Comparison of Resampling Methods for Clustering Ensembles // Proc of the International Conference on Machine Learning, Models, Technologies and Applications. Las Vegas, USA, 2004: 188-192 [16] Fred A L N, Jain A. Data Clustering Using Evidence Accumulation // Proc of the 16th International Conference on Pattern Recognition. Québec, Canada, 2002, Ⅳ: 276-280 [17] Fred A L N. Finding Consistent Clusters in Data Partitions // Proc of the 2nd International Workshop on Multiple Classifier Systems. Cambridge, UK, 2001: 309-318 [18] Strehl A, Ghosh J. Cluster Ensembles: A Knowledge Reuse Framework for Combining Multiple Partitions. Journal of Machine Learning Research, 2003, 3(3): 583-617 [19] Hadjitodorov S T, Kuncheva L I, Todorova L P. Moderate Diversity for Better Cluster Ensembles. Information Fusion, 2006, 7(3): 264-275 [20] Topchy A, Jain A K, Punch W. A Mixture Model for Clustering Ensembles // Proc of the 4th SIAM International Conference on Data Mining. Lake Buena Vista, USA, 2004: 379-390 [21] Topchy A, Minaei-Bidgoli B, Jain A K, et al. Adaptive Clustering Ensembles // Proc of the 17th International Conference on Pattern Recognition. Cambridge, UK, 2004: 272-275 [22] Han Jiawei, Kamber M. Data Mining: Concepts and Techniques. 2nd Edition. Orlando, USA: Margan Kaufmann, 2005 [23] Lazarevic A, Kumar V. Feature Bagging for Outlier Detection // Proc of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Chicago, USA, 2005: 157-166 [24] Zhou Zhihua, Tang Wei. Clusterer Ensemble. Knowledge-Based Systems, 2006, 19(1): 77-83 [25] Chen Si, Guo Gongde, Chen Lifei. Semi-Supervised Classification Based on Clustering Ensembles // Proc of the International Conference on Artificial Intelligence and Computational Intelligence. Shanghai, China, 2009: 629-638 [26] Su C T, Chen Longsheng, Yih Y. Knowledge Acquisition through Information Granulation for Imbalanced Data. Expert Systems with Applications, 2006, 31(3): 531-541 [27] Witten I H, Frank E. Data Mining: Practical Machine Learning Tools and Techniques. 2nd Edition. Orlando, USA: Morgan Kaufmann, 2005