基于密度峰值聚类的自适应欠采样方法

doi:10.16451/j.cnki.issn1003-6059.202009005

摘要
图/表
参考文献
相关文章 (15)

全文: PDF (724 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要基于K-means聚类的欠采样存在仅适用于超球形状数据、未考虑重叠区对分类的影响及簇中样本的稠密程度等问题.因此,文中提出基于密度峰值聚类的自适应欠采样方法.首先利用近邻搜索算法识别重叠区的多数类样本并将其删除.然后应用改进的密度峰值聚类自动获得多个不同形状、大小和密度的子簇.再根据子簇中样本的稠密程度计算采样权重并进行欠采样,在获得的平衡数据集上进行bagging集成分类.实验表明,文中方法在大多数数据集上性能表现较优.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	崔彩霞
	曹付元
	梁吉业

关键词 ：不平衡数据, 分类, 欠采样, 密度峰值聚类, 重叠区

Abstract：Undersampling based on K-means clustering is only suitable for hypersphere shape data, the impact of overlapping regions on classification is not taken into account, and the density of samples in the clusters is neglected. Therefore, an adaptive undersampling method based on density peak clustering is proposed. Firstly, the samples of the majority class in the overlapping region are identified by the nearest neighbor search algorithm and deleted. Secondly, a number of clusters of different shapes, sizes and densities are automatically obtained by improved density peaks clustering. Then, undersampling is performed according to the sampling weights calculated by the density of the samples in the subclusters, and bagging ensemble classification is conducted on the obtained balanced dataset. Experiments indicate that the performance of the proposed method is better on most datasets.

Key words： Imbalanced Data Classification Undersampling Density Peak Clustering Overlapping region

收稿日期: 2020-06-15

ZTFLH:

TP 391

基金资助:国家自然科学基金项目(No.61876103)、山西省重点研发计划项目(No.201903D121162)资助

通讯作者: 梁吉业,博士,教授,主要研究方向为人工智能、粒计算、数据挖掘、机器学习.E-mail:ljy@sxu.edu.cn.

作者简介: 崔彩霞,博士研究生,主要研究方向为数据挖掘、机器学习.E-mail:cuicaixia@tynu.edu.cn.曹付元,博士,教授,主要研究方向为数据挖掘、机器学习.E-mail:cfy@sxu.edu.cn.

引用本文:

崔彩霞, 曹付元, 梁吉业. 基于密度峰值聚类的自适应欠采样方法[J]. 模式识别与人工智能, 2020, 33(9): 811-819. CUI Caixia, CAO Fuyuan , LIANG Jiye. Adaptive Undersampling Based on Density Peak Clustering. , 2020, 33(9): 811-819.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/10.16451/j.cnki.issn1003-6059.202009005 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2020/V33/I9/811

[1] YUAN X H, XIE L J, ABOUELENIEN M. A Regularized Ensemble Framework of Deep Learning for Cancer Detection from Multi-class, Imbalanced Training Data. Pattern Recognition, 2018, 77(5): 160-172.
[2] FIORE U, DE SANTIS A, PERLA F, et al. Using Generative Adversarial Networks for Improving Classification Effectiveness in Credit Card Fraud Detection. Information Sciences, 2019, 479: 448-455.
[3] LIU J P, HE J Z, ZHANG W X, et al. ANID-SEoKELM: Adaptive Network Intrusion Detection Based on Selective Ensemble of Kernel ELMs with Random Features. Knowledge-Based Systems, 2019, 177(8): 104-116.
[4] LI Y J, GUO H X, ZHANG Q P, et al. Imbalanced Text Sentiment Classification Using Universal and Domain-Specific Knowledge. Knowledge-Based Systems, 2018, 160: 1-15.
[5] HU X H. A Data Mining Approach for Retailing Bank Customer Attrition Analysis. Applied Intelligence, 2005, 22(1): 47-60.
[6] CUI Y, JIA M L, LIN T Y, et al. Class-Balanced Loss Based on Effective Number of Samples // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2019: 9268-9277.
[7] GALAR M, FERNANDEZ A, BARRENECHEA E, et al. A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches. IEEE Transactions on Systems, Man, and Cybernetics(Applications and Reviews), 2012, 42(4): 463-484.
[8] CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE: Synthetic Minority Over-Sampling Technique. Journal of Artificial Inte-lligence Research, 2002, 16: 321-357.
[9] HAN H, WANG W Y, MAO B H. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning // Proc of the International Conference on Intelligent Computing. Berlin, Germany: Springer, 2005: 878-887.
[10] BUNKHUMPORNPAT C, SINAPIROMSARAN K, LURSINSAP C. Safe-Level-Smote: Safe-Level-Synthetic Minority Over-Sampling Technique for Handling the Class Imbalanced Problem // Proc of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. Berlin, Germany: Springer, 2009: 475-482.
[11] 林舒杨,李翠华,江弋,等.不平衡数据的降采样方法研究.计算机研究与发展, 2011, 48(Z2): 425-431.
(LIN S Y, LI C H, JIANG Y, et al. Under-Sampling Method Research in Class-Imbalanced Data. Journal of Computer Research and Development, 2011, 48(Z2): 425-431.)
[12] LIN W C, TSAI C F, HU Y H, et al. Clustering-Based Undersampling in Class-Imbalanced Data. Information Sciences, 2017, 409/410: 17-26.
[13] YEN S J, LEE Y S. Cluster-Based Under-Sampling Approaches for Imbalanced Data Distributions. Expert Systems with Applications, 2006, 36(3): 5718-5727.
[14] SOBHANI P, VIKTOR H, MATWIN S. Learning from Imbalanced Data Using Ensemble Methods and Cluster-Based Undersampling // Proc of the International Workshop on New Frontiers in Mining Complex Patterns. Berlin, Germany: Springer, 2014: 69-83.
[15] RODRIGUEZ A, LAIO A. Clustering by Fast Search and Find of Density Peaks. Science, 2014, 344(6191): 1492-1496.
[16] DU M J, DING S F, JIA H G. Study on Density Peaks Clustering Based on k-Nearest Neighbors and Principal Component Analysis. Knowledge-Based Systems, 2016, 99(5): 135-145.
[17] DENIL M, TRAPPENBERG T. Overlap versus Imbalance // Proc of the Canadian Conference on Advances in Artificial Intelligence. Berlin, Germany: Springer, 2010: 220-231.
[18] LEE H K, KIM S B. An Overlap-Sensitive Margin Classifier for Imbalanced and Overlapping Data. Expert Systems with Applications, 2018, 98(5): 72-83.
[19] VUTTIPITTAYAMONGKOL P, ELYAN E. Neighbourhood-Based Undersampling Approach for Handling Imbalanced and Overlapped Data. Information Sciences, 2020, 509: 47-70.
[20] KANG Q, CHEN X S, LI S S, et al. A Noise-Filtered Under-Sampling Scheme for Imbalanced Classification. IEEE Transactions on Cybernetics, 2017, 47(12): 4263-4274.
[21] LIU X Y, WU J X, ZHOU Z H. Exploratory Undersampling for Class-Imbalance Learning. IEEE Transactions on Systems, Man, and Cybernetics(Cybernetics), 2009, 39(2): 539-550.
[22] HE H B, GARCIA E A. Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(9): 1263-1284.
[23] SUN B, CHEN H Y, WANG J D, et al. Evolutionary Under-Sampling Based Bagging Ensemble Method for Imbalanced Data Classification. Frontiers of Computer Science, 2018, 12(2): 331-350.