Kernel SVM Algorithm Based on Identifying Key Samples for Imbalanced Data
GUO Ting1, WANG Jie1, LIU Quanming1, LIANG Jiye1,2
1.School of Computer and Information Technology, Shanxi University, Taiyuan 030006;
2.Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, Shanxi University, Taiyuan 030006
Under-sampling is often employed in imbalanced data processing. However, the characteristics of support vector machine(SVM) are seldom taken into account in the existing under-sampling methods,and the problem of losing some key information of the majority class is caused by the sampling in the original space. To solve these problems, a kernel SVM algorithm based on identifying key samples for imbalanced data(IK-KSVM) is proposed in this paper. Firstly, the majority class is divided effectively based on the initial hyperplane. Then, kernel heterogeneous nearest neighbor sampling is conducted on each partition to obtain the key samples of the majority class in the high-dimensional space. Finally, the final SVM classifier is trained by the key samples and the minority class samples. Experiments on several datasets show that IK-KSVM is feasible and effective and its advantages are evident while the imbalance degree of the dataset is higher than 10∶1.
[1] HE H B, GARCIA E A. Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(9): 1263-1284.
[2] WANG S, MINKU L L, YAO X. Resampling-Based Ensemble Methods for Online Class Imbalance Learning. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(5): 1356-1368.
[3] TAHIR M A, KITTLER J, YAN F. Inverse Random under Sampling for Class Imbalance Problem and Its Application to Multi-label Classification. Pattern Recognition, 2012, 45(10): 3738-3750.
[4] CHAWLA N V, BOWYER K, HALL L O, et al. SMOTE: Synthe-tic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research, 2011, 16: 321-357.
[5] SHAO Y H, CHEN W J, ZHANG J J, et al. An Efficient Weighted Lagrangian Twin Support Vector Machine for Imbalanced Data Cla-ssification. Pattern Recognition, 2014, 47(9): 3158-3167.
[6] AKBAIN R, KWEK S, JAPKOWICZ N. Applying Support Vector Machines to Imbalanced Data Sets // Proc of the European Confe-rence on Machine Learning. Berlin, Germany: Springer, 2004: 39-50.
[7] WANG B X, JAPKOWICZ N. Boosting Support Vector Machines for Imbalanced Data Sets. Knowledge and Information Systems, 2010, 25(1): 1-20
[8] SUN Z B, SONG Q B, ZHU X Y, et al. A Novel Ensemble Method for Classifying Imbalanced Data. Pattern Recognition, 2015, 48(5): 1623-1637.
[9] GUO H X, LI Y J, JENNIFER S, et al. Learning from Class-Imba-lanced Data: Review of Methods and Applications. Expert Systems with Applications, 2016, 73(1): 220-239.
[10] ZHANG J P, MANI I. KNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction // Proc of the International Conference on Machine Learning. Palo Alto, USA: AAAI Press, 2003: 42-48.
[11] LIN W C, TSAI C F, HU Y H, et al. Clustering-Based Under-sampling in Class-Imbalanced Data. Information Sciences, 2017, 409/410: 17-26.
[12] KANG Q, SHI L, ZHOU M C, et al. A Distance-Based Weighted Undersampling Scheme for Support Vector Machines and Its Application to Imbalanced Classification. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(9): 4152-4165.
[13] JIAN C X, GAO J, AO Y H. A New Sampling Method for Classi-fying Imbalanced Data Based on Support Vector Machine Ensemble. Neurocomputing, 2016, 193: 115-122.
[14] 孙建涛,郭崇慧,陆玉昌,等.多项式核支持向量机文本分类器泛化性能分析.计算机研究与发展, 2004, 41(8): 1321-1326.
(SUN J T, GUO C H, LU Y C, et al. Estimating the Generalization Performance of Polynomial SVM Classifier for Text Categorization. Journal of Computer Research and Development, 2004, 41(8): 1321-1326.)
[15] KANG S, CHO S. Approximating Support Vector Machine with ArtificialNeuralNetwork for FastPrediction. ExpertSystemswith Applications, 2014, 41(10): 4989-4995.
[16] 张学工.关于统计学习理论与支持向量机.自动化学报, 2000, 26(1): 32-42.
(ZHANG X G. Introduction to Statistical Learning Theory and Support Vector Machines. Acta Automatica Sinica, 2000, 26(1): 32-42.)
[17] ANGIULLI F, FOLINO G. Distributed Nearest Neighbor-Based Condensation of Very Large Data Sets. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(12): 1593-1606.
[18] LIN C T, HSIEH T Y, LIU Y T, et al. Minority Oversampling in Kernel Adaptive Subspaces for Class Imbalanced Datasets. IEEE Transactions on Knowledge and Data Engineering, 2017, 30(5): 950-961.
[19] SU C T, CHEN L S, YI Y. Knowledge Acquisition through Information Granulation for Imbalanced Data. Expert Systems with Applications, 2006, 31(3): 531-541.
[20] TANTITHAMTHAVORN C, MCINTOSH S, HASSAN A E, et al. An Empirical Comparison of Model Validation Techniques for Defect Prediction Models. IEEE Transactions on Software Enginee-ring, 2016, 43(1): 1-18.