1.Key Laboratory of Computer Network and Information Integration, Ministry of Education, School of Computer Science and Engineering, Southeast University, Nanjing 211189 2.National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 211189
Abstract:The potential useful information in the majority class is ignored by stochastic under-sampling. When under-sampling is applied to multi-class imbalance problem, this situation becomes even worse. In this paper, EasyEnsemble.M for multi-class imbalance problem is proposed. The potential useful information contained in the majority classes which is ignored is explored by stochastic sampling the majority classes for multiple times. Then, sub-classifiers are learned and a strong classifier is obtained by using hybrid ensemble techniques. Experimental results show that EasyEnsemble.M is superior to other frequently used multi-class imbalance learning methods when G-mean is used as performance measure.
[1] Ye Z F, Wen Y M, Lü B L. A Survey of Imbalanced Pattern Classification Problems. CAAI Trans on Intelligent Systems, 2009, 4(2): 148-156 (in Chinese) (叶志飞,文益民,吕宝粮.不平衡分类问题研究综述.智能系统学报, 2009, 4(2): 148-156) [2] Dong Y J. Random-SMOTE Method for Imbalanced Data Sets. Master Dissertation. Dalian, China: Dalian University of Technology, 2009 (in Chinese) (董燕杰.不平衡数据集分类的Random-SMOTE方法研究.硕士学位论文.大连:大连理工大学, 2009) [3] Chawla N V. Data Mining for Imbalanced Datasets: An Overview[EB/OL]. [2013-03-10]. http://link.springer.com/chapter/10.1007%2F0-387-25465-X_40#page-1 [4] Davenport M. Introduction to Modern Information Retrieval. Journal of the Medical Library Association, 2012. DOI: 10.3163/1536-5050.100 [5] Bradley A P. The Use of the Area under the ROC Curve in the Evaluation of Machine Learning Algorithms. Pattern Recognition, 1997, 30(7): 1145-1159 [6] Drummond C, Holte R C. C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling Beats Over-Sampling[EB/OL]. [2013-03-10]. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.68.6858&repl&type=pdf [7] Wang S, Yao X. Multiclass Imbalance Problems: Analysis and Potential Solutions. IEEE Trans on Systems, Man and Cybernetics, 2012, 42(4): 1119-1130 [8] Zhao X M, Li X, Chen L N, et al. Protein Classification with Imbalanced Data. Proteins: Structure, Function and Bioinformatics, 2008, 70(4): 1125-1132 [9] Chen K, Lü B L, Kwok J T. Efficient Classification of Multi-label and Imbalanced Data Using Min-Max Modular Classifiers // Proc of the International Joint Conference on Neural Networks. Vancouver, Canada, 2006: 1170-1775 [10] Tan A C, Gilbert D, Deville Y. Multi-class Protein Fold Classification Using a New Ensemble Machine Learning Approach. Genome Information, 2003, 14: 206-217 [11] Liao T W. Classification of Weld Flaws with Imbalanced Class Data. Expert Systems with Applications, 2008, 35(3): 1041-1052 [12] Zhou Z H, Liu X Y. Training Cost-Sensitive Neural Network with Methods Addressing the Class Imbalance Problem. IEEE Trans on Knowledge Data Engineering, 2006, 18(1): 63-77 [13] Rifkin R, Klautau A. In Defense of One-vs-All Classification. The Journal of Machine Learning Research, 2004, 5: 101-141 [14] Hastie T, Tibshirani R. Classification by Pairwise Coupling. The Annals of Statistics, 1998, 26(2): 451-471 [15] Alejo R, Sotoca J, Valdovinos R, et al. The Multi-class Imbalance Problem: Cost Functions with Modular and Non-Modular Neural Networks // Proc of the 6th International Symposium on Neural Networks. Berlin, Germany: Springer, 2009: 421-431 [16] Sun Y, Kamel M S, Wang Y. Boosting for Learning Multiple Classes with Imbalanced Class Distribution // Proc of the 6th IEEE Industrial Conference on Data Ming. Hong Kong, China, 2006: 592-602 [17] Freund Y, Schapire R. A Decision-Theoretic Generalization of On-line Learning and an Application to Boosting. Journal of Computer and System Sciences, 1997, 55(1): 119-139 [18] Wang S, Chen H H, Yao X. Negative Correlation Learning for Classification Ensembles // Proc of the International Joint Conference on Neural Networks. Barcelona, Spain, 2010: 1-8 [19] Hoens T, Qian Q, Chawla N, et al. Building Decision Trees for the Multi-class Imbalance Problem. Lecture Notes in Computer Science, 2012. DOI:10.1007/978-3-642-30217-6_11 [20] Cieslak D A, Chawla N V. Learning Decision Trees for Unbalanced Data // Proc of the European Conference on Machine Learning and Knowledge Discovery in Databases. Antwerp, Belgium, 2008, I: 241-256 [21] Liu X Y, Wu J X, Zhou Z H. Exploratory Undersampling for Class-Imbalance Learning. IEEE Trans on Systems, Man and Cybernetics, 2009, 39(2): 539-550 [22] Wang S, Yao X. Theoretical Study of the Relationship between Diversity and Single-Class Measures for Class Imbalance Learning // Proc of the IEEE International Conference on Data Mining. Miami, USA, 2009: 76-81 [23] Breiman L, Friedman J, Stone C J, et al. Classification and Regression Trees. London, UK: Chapman & Hall, 1984 [24] Hand D J, Till R J. A Simple Generalisation of the Area under the ROC Curve for Multiple Class Classification Problems. Machine Learning, 2001, 45(2): 171-186 多类类别不平衡问题的困难在于类别间样本分布的多样化和概念复杂度的增加,各类别样本数目只占总训练集的一小部分.即使在类别平衡的情况下,该类样本也很难成功的学习到正确概念. 随机欠采样方法广泛应用于两类类别不平衡分类问题中,优点是简单高效,缺点忽略了很多潜在有用的大类样本信息.有研究指出[7],与随机过采样方法相比,随机欠采样方法与之相当甚至更优.在面对多类的类别不平衡问题时,尤其是当小类样本数非常少的情况,欠采样方法忽略潜在有用信息的缺点会带来更严重的问题.直接使用欠采样方法很难得到很好的分类性能[10]. [17] Sun Y, Kamel M S, Wang Y. Boosting for Learning Multiple Classes with Imbalanced Class Distribution [C]. In: Proceedings of the 6th IEEE Industrial Conference on Data Ming, 2006: 592-602.