基于识别关键样本点的非平衡数据核SVM算法

doi:10.16451/j.cnki.issn1003-6059.201906009

摘要
图/表
参考文献(20)
相关文章 (15)

全文: PDF (805 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要

非平衡数据处理中常采用的欠采样方法很少考虑支持向量机(SVM)的特性,并且在原始空间进行采样会导致多数类样本部分关键信息的丢失.针对上述问题,文中提出基于识别关键样本点的非平衡数据核SVM算法.基于初始超平面有效划分多数类样本,在高维空间中对每个分块进行核异类近邻抽样,得到多数类中的关键样本点,使用关键样本点和少数类样本训练最终核SVM分类器.在多个数据集上的实验证明文中算法的可行性和有效性,特别是在非平衡度高于10∶1的数据集上,文中算法优势明显.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	郭婷
	王杰
	刘全明
	梁吉业

关键词 ：非平衡数据集, 核支持向量机, 划分, 欠采样

Abstract：

Under-sampling is often employed in imbalanced data processing. However, the characteristics of support vector machine(SVM) are seldom taken into account in the existing under-sampling methods,and the problem of losing some key information of the majority class is caused by the sampling in the original space. To solve these problems, a kernel SVM algorithm based on identifying key samples for imbalanced data(IK-KSVM) is proposed in this paper. Firstly, the majority class is divided effectively based on the initial hyperplane. Then, kernel heterogeneous nearest neighbor sampling is conducted on each partition to obtain the key samples of the majority class in the high-dimensional space. Finally, the final SVM classifier is trained by the key samples and the minority class samples. Experiments on several datasets show that IK-KSVM is feasible and effective and its advantages are evident while the imbalance degree of the dataset is higher than 10∶1.

Key words： Imbalanced Data Kernel Support Vector Machine Partition Under-Sampling

收稿日期: 2019-03-05

ZTFLH:

TP 18

基金资助:

国家自然科学基金项目(No.61876103)、山西省重点研发计划重点项目(No.201603D111014)、山西省1331工程项目资助

作者简介: 郭婷,硕士研究生,主要研究方向为数据挖掘、机器学习.E-mail:876067312@qq.com.王杰,博士研究生,主要研究方向为数据挖掘、机器学习.E-mail:812849431@qq.com.刘全明,博士,副教授,主要研究方向为云存储与云安全、网络行为分析、数据挖掘.E-mail:liuqm@sxu.edu.cn.梁吉业(通讯作者),博士,教授,主要研究方向为粒计算、数据挖掘、机器学习.E-mail:ljy@sxu.edu.cn.

引用本文:

郭婷, 王杰, 刘全明, 梁吉业. 基于识别关键样本点的非平衡数据核SVM算法[J]. 模式识别与人工智能, 2019, 32(6): 569-576. GUO Ting, WANG Jie, LIU Quanming, LIANG Jiye. Kernel SVM Algorithm Based on Identifying Key Samples for Imbalanced Data. , 2019, 32(6): 569-576.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/10.16451/j.cnki.issn1003-6059.201906009 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2019/V32/I6/569

[1] HE H B, GARCIA E A. Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(9): 1263-1284.
[2] WANG S, MINKU L L, YAO X. Resampling-Based Ensemble Methods for Online Class Imbalance Learning. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(5): 1356-1368.
[3] TAHIR M A, KITTLER J, YAN F. Inverse Random under Sampling for Class Imbalance Problem and Its Application to Multi-label Classification. Pattern Recognition, 2012, 45(10): 3738-3750.
[4] CHAWLA N V, BOWYER K, HALL L O, et al. SMOTE: Synthe-tic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research, 2011, 16: 321-357.
[5] SHAO Y H, CHEN W J, ZHANG J J, et al. An Efficient Weighted Lagrangian Twin Support Vector Machine for Imbalanced Data Cla-ssification. Pattern Recognition, 2014, 47(9): 3158-3167.
[6] AKBAIN R, KWEK S, JAPKOWICZ N. Applying Support Vector Machines to Imbalanced Data Sets // Proc of the European Confe-rence on Machine Learning. Berlin, Germany: Springer, 2004: 39-50.
[7] WANG B X, JAPKOWICZ N. Boosting Support Vector Machines for Imbalanced Data Sets. Knowledge and Information Systems, 2010, 25(1): 1-20
[8] SUN Z B, SONG Q B, ZHU X Y, et al. A Novel Ensemble Method for Classifying Imbalanced Data. Pattern Recognition, 2015, 48(5): 1623-1637.
[9] GUO H X, LI Y J, JENNIFER S, et al. Learning from Class-Imba-lanced Data: Review of Methods and Applications. Expert Systems with Applications, 2016, 73(1): 220-239.
[10] ZHANG J P, MANI I. KNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction // Proc of the International Conference on Machine Learning. Palo Alto, USA: AAAI Press, 2003: 42-48.
[11] LIN W C, TSAI C F, HU Y H, et al. Clustering-Based Under-sampling in Class-Imbalanced Data. Information Sciences, 2017, 409/410: 17-26.
[12] KANG Q, SHI L, ZHOU M C, et al. A Distance-Based Weighted Undersampling Scheme for Support Vector Machines and Its Application to Imbalanced Classification. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(9): 4152-4165.
[13] JIAN C X, GAO J, AO Y H. A New Sampling Method for Classi-fying Imbalanced Data Based on Support Vector Machine Ensemble. Neurocomputing, 2016, 193: 115-122.
[14] 孙建涛,郭崇慧,陆玉昌,等.多项式核支持向量机文本分类器泛化性能分析.计算机研究与发展, 2004, 41(8): 1321-1326.
(SUN J T, GUO C H, LU Y C, et al. Estimating the Generalization Performance of Polynomial SVM Classifier for Text Categorization. Journal of Computer Research and Development, 2004, 41(8): 1321-1326.)
[15] KANG S, CHO S. Approximating Support Vector Machine with ArtificialNeuralNetwork for FastPrediction. ExpertSystemswith Applications, 2014, 41(10): 4989-4995.
[16] 张学工.关于统计学习理论与支持向量机.自动化学报, 2000, 26(1): 32-42.
(ZHANG X G. Introduction to Statistical Learning Theory and Support Vector Machines. Acta Automatica Sinica, 2000, 26(1): 32-42.)
[17] ANGIULLI F, FOLINO G. Distributed Nearest Neighbor-Based Condensation of Very Large Data Sets. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(12): 1593-1606.
[18] LIN C T, HSIEH T Y, LIU Y T, et al. Minority Oversampling in Kernel Adaptive Subspaces for Class Imbalanced Datasets. IEEE Transactions on Knowledge and Data Engineering, 2017, 30(5): 950-961.
[19] SU C T, CHEN L S, YI Y. Knowledge Acquisition through Information Granulation for Imbalanced Data. Expert Systems with Applications, 2006, 31(3): 531-541.
[20] TANTITHAMTHAVORN C, MCINTOSH S, HASSAN A E, et al. An Empirical Comparison of Model Validation Techniques for Defect Prediction Models. IEEE Transactions on Software Enginee-ring, 2016, 43(1): 1-18.