多类类别不平衡学习算法:EasyEnsemble.M*

李倩倩,刘胥影

PDF(378 KB)
模式识别与人工智能 ›› 2014, Vol. 27 ›› Issue (2) : 187-192.
研究与应用

多类类别不平衡学习算法:EasyEnsemble.M*

作者信息 +

EasyEnsemble.M for Multiclass Imbalance Problem

Author information +
History +

摘要

随机欠采样方法忽略潜在有用的大类样本信息,在面对多类分类问题时更为突出。文中提出多类类别不平衡学习算法:EasyEnsemble.M。该算法通过多次针对大类样本随机采样,充分利用被随机欠采样方法忽略的潜在有用的大类样本,学习多个子分类器,利用混合的集成技术最终得到性能较优的强分类器。实验结果表明,与常用的多类类别不平衡学习算法相比,EasyEnsemble.M可有效提高分类器的G-mean值。

Abstract

The potential useful information in the majority class is ignored by stochastic under-sampling. When under-sampling is applied to multi-class imbalance problem, this situation becomes even worse. In this paper, EasyEnsemble.M for multi-class imbalance problem is proposed. The potential useful information contained in the majority classes which is ignored is explored by stochastic sampling the majority classes for multiple times. Then, sub-classifiers are learned and a strong classifier is obtained by using hybrid ensemble techniques. Experimental results show that EasyEnsemble.M is superior to other frequently used multi-class imbalance learning methods when G-mean is used as performance measure.

关键词

机器学习 / 类别不平衡学习 / 欠采样 / 集成

Key words

Machine Learning / Class-Imbalance Learning / Under-Sampling / Ensemble

引用本文

导出引用
李倩倩 , 刘胥影. 多类类别不平衡学习算法:EasyEnsemble.M*. 模式识别与人工智能. 2014, 27(2): 187-192
LI Qian-Qian , LIU Xu-Ying. EasyEnsemble.M for Multiclass Imbalance Problem. Pattern Recognition and Artificial Intelligence. 2014, 27(2): 187-192

参考文献

[1] Ye Z F, Wen Y M, Lü B L. A Survey of Imbalanced Pattern Classification Problems. CAAI Trans on Intelligent Systems, 2009, 4(2): 148-156 (in Chinese)
(叶志飞,文益民,吕宝粮.不平衡分类问题研究综述.智能系统学报, 2009, 4(2): 148-156)
[2] Dong Y J. Random-SMOTE Method for Imbalanced Data Sets. Master Dissertation. Dalian, China: Dalian University of Technology, 2009 (in Chinese)
(董燕杰.不平衡数据集分类的Random-SMOTE方法研究.硕士学位论文.大连:大连理工大学, 2009)
[3] Chawla N V. Data Mining for Imbalanced Datasets: An Overview[EB/OL]. [2013-03-10]. http://link.springer.com/chapter/10.1007%2F0-387-25465-X_40#page-1
[4] Davenport M. Introduction to Modern Information Retrieval. Journal of the Medical Library Association, 2012. DOI: 10.3163/1536-5050.100
[5] Bradley A P. The Use of the Area under the ROC Curve in the Evaluation of Machine Learning Algorithms. Pattern Recognition, 1997, 30(7): 1145-1159
[6] Drummond C, Holte R C. C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling Beats Over-Sampling[EB/OL]. [2013-03-10]. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.68.6858&repl&type=pdf
[7] Wang S, Yao X. Multiclass Imbalance Problems: Analysis and Potential Solutions. IEEE Trans on Systems, Man and Cybernetics, 2012, 42(4): 1119-1130
[8] Zhao X M, Li X, Chen L N, et al. Protein Classification with Imbalanced Data. Proteins: Structure, Function and Bioinformatics, 2008, 70(4): 1125-1132
[9] Chen K, Lü B L, Kwok J T. Efficient Classification of Multi-label and Imbalanced Data Using Min-Max Modular Classifiers // Proc of the International Joint Conference on Neural Networks. Vancouver, Canada, 2006: 1170-1775
[10] Tan A C, Gilbert D, Deville Y. Multi-class Protein Fold Classification Using a New Ensemble Machine Learning Approach. Genome Information, 2003, 14: 206-217
[11] Liao T W. Classification of Weld Flaws with Imbalanced Class Data. Expert Systems with Applications, 2008, 35(3): 1041-1052
[12] Zhou Z H, Liu X Y. Training Cost-Sensitive Neural Network with Methods Addressing the Class Imbalance Problem. IEEE Trans on Knowledge Data Engineering, 2006, 18(1): 63-77
[13] Rifkin R, Klautau A. In Defense of One-vs-All Classification. The Journal of Machine Learning Research, 2004, 5: 101-141
[14] Hastie T, Tibshirani R. Classification by Pairwise Coupling. The Annals of Statistics, 1998, 26(2): 451-471
[15] Alejo R, Sotoca J, Valdovinos R, et al. The Multi-class Imbalance Problem: Cost Functions with Modular and Non-Modular Neural Networks // Proc of the 6th International Symposium on Neural Networks. Berlin, Germany: Springer, 2009: 421-431
[16] Sun Y, Kamel M S, Wang Y. Boosting for Learning Multiple Classes with Imbalanced Class Distribution // Proc of the 6th IEEE Industrial Conference on Data Ming. Hong Kong, China, 2006: 592-602
[17] Freund Y, Schapire R. A Decision-Theoretic Generalization of On-line Learning and an Application to Boosting. Journal of Computer and System Sciences, 1997, 55(1): 119-139
[18] Wang S, Chen H H, Yao X. Negative Correlation Learning for Classification Ensembles // Proc of the International Joint Conference on Neural Networks. Barcelona, Spain, 2010: 1-8
[19] Hoens T, Qian Q, Chawla N, et al. Building Decision Trees for the Multi-class Imbalance Problem. Lecture Notes in Computer Science, 2012. DOI:10.1007/978-3-642-30217-6_11
[20] Cieslak D A, Chawla N V. Learning Decision Trees for Unbalanced Data // Proc of the European Conference on Machine Learning and Knowledge Discovery in Databases. Antwerp, Belgium, 2008, I: 241-256
[21] Liu X Y, Wu J X, Zhou Z H. Exploratory Undersampling for Class-Imbalance Learning. IEEE Trans on Systems, Man and Cybernetics, 2009, 39(2): 539-550
[22] Wang S, Yao X. Theoretical Study of the Relationship between Diversity and Single-Class Measures for Class Imbalance Learning // Proc of the IEEE International Conference on Data Mining. Miami, USA, 2009: 76-81
[23] Breiman L, Friedman J, Stone C J, et al. Classification and Regression Trees. London, UK: Chapman & Hall, 1984
[24] Hand D J, Till R J. A Simple Generalisation of the Area under the ROC Curve for Multiple Class Classification Problems. Machine Learning, 2001, 45(2): 171-186
多类类别不平衡问题的困难在于类别间样本分布的多样化和概念复杂度的增加,各类别样本数目只占总训练集的一小部分.即使在类别平衡的情况下,该类样本也很难成功的学习到正确概念.
随机欠采样方法广泛应用于两类类别不平衡分类问题中,优点是简单高效,缺点忽略了很多潜在有用的大类样本信息.有研究指出[7],与随机过采样方法相比,随机欠采样方法与之相当甚至更优.在面对多类的类别不平衡问题时,尤其是当小类样本数非常少的情况,欠采样方法忽略潜在有用信息的缺点会带来更严重的问题.直接使用欠采样方法很难得到很好的分类性能[10].
[17] Sun Y, Kamel M S, Wang Y. Boosting for Learning Multiple Classes with Imbalanced Class Distribution [C]. In: Proceedings of the 6th IEEE Industrial Conference on Data Ming, 2006: 592-602.

基金

国家自然科学基金青年基金项目(No.61105046)、教育部高等学校博士学科点专项科研基金项目(No.20110092120029)、南京大学软件新技术国家重点实验室开放课题项目(No.KFKT2011B01)资助
PDF(378 KB)

2406

Accesses

0

Citation

Detail

段落导航
相关文章

/