Abstract:For highly unbalanced data, insufficient learning of minority class samples is caused by self-sampling method of the traditional cost sensitive random forest algorithm, and the cost sensitive mechanism of the algorithm is easily weakened by the large proportion of majority class samples. Therefore, a weak balance cost sensitive random forest algorithm based on clustering is proposed. After clustering the majority class samples, the weak balance criterion is used to reduce the samples of each cluster repeatedly. The selected majority class samples and the minority class samples of the original training set are fused to generate a number of new unbalanced datasets for the training of cost sensitive decision tree. The proposed algorithm not only enables the minority class samples to be fully learned, but also ensures that the cost sensitive mechanism is less affected by reducing the majority class samples. Experiment indicates the better performance of the proposed algorithm in processing highly unbalanced datasets.
[1] BAHNSEN A C, AOUADA D, STOJANOVIC A, et al. Feature Engineering Strategies for Credit Card Fraud Detection. Expert Systems with Applications, 2016, 51: 134-142. [2] FU K, CHENG D W, TU Y, et al. Credit Card Fraud Detection Using Convolutional Neural Networks // Proc of the International Conference on Neural Information Processing. Berlin, Germany: Springer, 2016: 483-490. [3] REN F L, CAO P, LI W, et al. Ensemble Based Adaptive Over-Sampling Method for Imbalanced Data Learning in Computer Aided Detection of Microaneurysm. Computer Medical Imaging and Graphics, 2017, 55: 54-67. [4] GAO T, HAO Y G, ZHANG H P, et al. Predicting Pathological Response to Neoadjuvant Chemotherapy in Breast Cancer Patients Based on Imbalanced Clinical Data. Personal and Ubiquitous Computing, 2018, 22: 1039-1047. [5] CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE: Synthetic Minority Oversampling Technique. Journal of Artificial Intelligence Research, 2002, 16(1): 321-357. [6] WANG K J, ADRIAN A M, CHEN K H, et al. A Hybrid Classifier Combining Borderline-SMOTE with AIRS Algorithm for Estimating Brain Metastasis from Lung Cancer: A Case Study in Taiwan. Computer Methods and Programs in Biomedicine, 2015, 119(2): 63-76. [7] DOUZAS G, BACAO F, LAST F. Improving Imbalanced Learning through a Heuristic Oversampling Method Based on K-means and SMOTE. Information Sciences, 2018, 465: 1-20. [8] 石洪波,刘焱昕,冀素琴.基于安全样本筛选的不平衡数据抽样方法.模式识别与人工智能, 2019, 32(6): 545-556. (SHI H B, LIU Y X, JI S Q. Safe Sample Screening Based Sampling Method for Imbalanced Data. Pattern Recognition and Artificial Intelligence, 2019, 32(6): 545-556.) [9] CASTELLANOS F J, VALERO-MAS J J, CALVO-ZARAGOZA J, et al. Oversampling Imbalanced Data in the String Space. Pattern Recognition Letters, 2018, 103: 32-38. [10] TRIGUERO I, GALAR M, BUSTINCE H, et al. First Attempt on Global Evolutionary Undersampling for Imbalanced Big Data // Proc of the IEEE Congress on Evolutionary Computation. Washington, USA: IEEE, 2017: 2054-2061. [11] LIN W C, TSAI C F, HU Y H, et al. Clustering-Based Undersampling in Class Imbalanced Data. Information Sciences, 2017, 409/410: 17-26. [12] 郭 婷,王 杰,刘全明,等.基于识别关键样本点的非平衡数据核SVM算法.模式识别与人工智能, 2019, 32(6): 569-576. (GUO T, WANG J, LIU Q M, et al. Kernel SVM Algorithm Based on Identifying Key Samples for Imbalanced Data. Pattern Recognition and Artificial Intelligence, 2019, 32(6): 569-576.) [13] SEIFFERT C, KHOSHGOFTAAR T M, VAN HULSE J, et al. RUSBoost: A Hybrid Approach to Alleviating Class Imbalance. IEEE Transactions on Systems, Man, and Cybernetics(Systems and Humans), 2010, 40(1): 185-197. [14] RAYHAN F, AHMED S, MAHBUB A, et al. CUSBoost: Cluster-Based Under-Sampling with Boosting for Imbalanced Classification[C/OL]. [2019-07-25]. https://arxiv.org/pdf/1712.04356.pdf. [15] CHAWLA N V, LAZAREVIC A, LAWRENCE O, et al. SMOTEBoost: Improving Prediction of the Minority Class in Boosting // Proc of the European Conference on Principles of Data Mining and Knowledge Discovery. Berlin, Germany: Springer, 2003: 107-119. [16] ZHANG J L, GARCIA J. Online Classifier Adaptation for Cost-Sensitive Learning. Neural Computing and Applications, 2016, 27(3): 781-789. [17] 袁兴梅,杨 明,杨 杨.一种面向不平衡数据的结构化SVM集成分类器.模式识别与人工智能, 2013, 26(3): 315-320. (YUAN X M, YANG M, YANG Y. An Ensemble Classifier Based on Structural Support Vector Machine for Imbalanced Data. Pattern Recognition and Artificial Intelligence, 2013, 26(3): 315-320.) [18] LOMAX S, VADERA S. A Survey of Cost-Sensitive Decision Tree Induction Algorithms. ACM Computing Surveys, 2013, 45(2). DOI:10.1145/2431211.2431215. [19] ARAR Ö F, AYAN K. Software Defect Prediction Using Cost-Sensitive Neural Network. Applied Soft Computing, 2015, 33: 263-277. [20] GU B, QUAN X, GU Y H, et al. Chunk Incremental Learning for Cost-Sensitive Hinge Loss Support Vector Machine. Pattern Recognition, 2018, 83: 196-208. [21] 陶新民,李晨曦,沈 微,等.基于密度敏感最大软间隔SVDD不均衡数据分类算法.电子学报, 2018, 46(11): 2725-2732. (TAO X M, LI C X, SHEN W, et al. The SVDD Classifier for Unbalanced Data Based on Density-Sensitive and Maximum Soft Margin. Acta Electronica Sinica, 2008, 46 (11): 2725-2732.) [22] SUN Y M, KAMEL M S, WONG A K C, et al. Cost-Sensitive Boosting for Classification of Imbalanced Data. Pattern Recognition, 2007, 40(12): 3358-3378. [23] KRAWCZYK B, WOZNIAK M, SCHAEFER G. Cost-Sensitive De-cision Tree Ensembles for Effective Imbalanced Classification. Applied Soft Computing, 2013, 14(1) : 554-562. [24] NÚÑEZ M. The Use of Background Knowledge in Decision Tree Induction. Machine Learning, 1991, 6: 231-250. [25] AKBANI R, KWEK S, JAPKOWICZ N. Applying Support Vector Machines to Imbalanced Datasets // Proc of the European Confe-rence on Machine Learning. Berlin, Germany: Springer, 2004: 39-50.