Adaptive Undersampling Based on Density Peak Clustering
CUI Caixia1,2, CAO Fuyuan1,3 , LIANG Jiye1,3
1. School of Computer and Information Technology, Shanxi University, Taiyuan 030006 2. Computer Science and Technology Department, Taiyuan Normal University, Jinzhong 030619 3. Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, Shanxi University, Taiyuan 030006
Abstract:Undersampling based on K-means clustering is only suitable for hypersphere shape data, the impact of overlapping regions on classification is not taken into account, and the density of samples in the clusters is neglected. Therefore, an adaptive undersampling method based on density peak clustering is proposed. Firstly, the samples of the majority class in the overlapping region are identified by the nearest neighbor search algorithm and deleted. Secondly, a number of clusters of different shapes, sizes and densities are automatically obtained by improved density peaks clustering. Then, undersampling is performed according to the sampling weights calculated by the density of the samples in the subclusters, and bagging ensemble classification is conducted on the obtained balanced dataset. Experiments indicate that the performance of the proposed method is better on most datasets.
[1] YUAN X H, XIE L J, ABOUELENIEN M. A Regularized Ensemble Framework of Deep Learning for Cancer Detection from Multi-class, Imbalanced Training Data. Pattern Recognition, 2018, 77(5): 160-172. [2] FIORE U, DE SANTIS A, PERLA F, et al. Using Generative Adversarial Networks for Improving Classification Effectiveness in Credit Card Fraud Detection. Information Sciences, 2019, 479: 448-455. [3] LIU J P, HE J Z, ZHANG W X, et al. ANID-SEoKELM: Adaptive Network Intrusion Detection Based on Selective Ensemble of Kernel ELMs with Random Features. Knowledge-Based Systems, 2019, 177(8): 104-116. [4] LI Y J, GUO H X, ZHANG Q P, et al. Imbalanced Text Sentiment Classification Using Universal and Domain-Specific Knowledge. Knowledge-Based Systems, 2018, 160: 1-15. [5] HU X H. A Data Mining Approach for Retailing Bank Customer Attrition Analysis. Applied Intelligence, 2005, 22(1): 47-60. [6] CUI Y, JIA M L, LIN T Y, et al. Class-Balanced Loss Based on Effective Number of Samples // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2019: 9268-9277. [7] GALAR M, FERNANDEZ A, BARRENECHEA E, et al. A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches. IEEE Transactions on Systems, Man, and Cybernetics(Applications and Reviews), 2012, 42(4): 463-484. [8] CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE: Synthetic Minority Over-Sampling Technique. Journal of Artificial Inte-lligence Research, 2002, 16: 321-357. [9] HAN H, WANG W Y, MAO B H. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning // Proc of the International Conference on Intelligent Computing. Berlin, Germany: Springer, 2005: 878-887. [10] BUNKHUMPORNPAT C, SINAPIROMSARAN K, LURSINSAP C. Safe-Level-Smote: Safe-Level-Synthetic Minority Over-Sampling Technique for Handling the Class Imbalanced Problem // Proc of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. Berlin, Germany: Springer, 2009: 475-482. [11] 林舒杨,李翠华,江 弋,等.不平衡数据的降采样方法研究.计算机研究与发展, 2011, 48(Z2): 425-431. (LIN S Y, LI C H, JIANG Y, et al. Under-Sampling Method Research in Class-Imbalanced Data. Journal of Computer Research and Development, 2011, 48(Z2): 425-431.) [12] LIN W C, TSAI C F, HU Y H, et al. Clustering-Based Undersampling in Class-Imbalanced Data. Information Sciences, 2017, 409/410: 17-26. [13] YEN S J, LEE Y S. Cluster-Based Under-Sampling Approaches for Imbalanced Data Distributions. Expert Systems with Applications, 2006, 36(3): 5718-5727. [14] SOBHANI P, VIKTOR H, MATWIN S. Learning from Imbalanced Data Using Ensemble Methods and Cluster-Based Undersampling // Proc of the International Workshop on New Frontiers in Mining Complex Patterns. Berlin, Germany: Springer, 2014: 69-83. [15] RODRIGUEZ A, LAIO A. Clustering by Fast Search and Find of Density Peaks. Science, 2014, 344(6191): 1492-1496. [16] DU M J, DING S F, JIA H G. Study on Density Peaks Clustering Based on k-Nearest Neighbors and Principal Component Analysis. Knowledge-Based Systems, 2016, 99(5): 135-145. [17] DENIL M, TRAPPENBERG T. Overlap versus Imbalance // Proc of the Canadian Conference on Advances in Artificial Intelligence. Berlin, Germany: Springer, 2010: 220-231. [18] LEE H K, KIM S B. An Overlap-Sensitive Margin Classifier for Imbalanced and Overlapping Data. Expert Systems with Applications, 2018, 98(5): 72-83. [19] VUTTIPITTAYAMONGKOL P, ELYAN E. Neighbourhood-Based Undersampling Approach for Handling Imbalanced and Overlapped Data. Information Sciences, 2020, 509: 47-70. [20] KANG Q, CHEN X S, LI S S, et al. A Noise-Filtered Under-Sampling Scheme for Imbalanced Classification. IEEE Transactions on Cybernetics, 2017, 47(12): 4263-4274. [21] LIU X Y, WU J X, ZHOU Z H. Exploratory Undersampling for Class-Imbalance Learning. IEEE Transactions on Systems, Man, and Cybernetics(Cybernetics), 2009, 39(2): 539-550. [22] HE H B, GARCIA E A. Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(9): 1263-1284. [23] SUN B, CHEN H Y, WANG J D, et al. Evolutionary Under-Sampling Based Bagging Ensemble Method for Imbalanced Data Classification. Frontiers of Computer Science, 2018, 12(2): 331-350.