An Ensemble Classifier Based on Structural Support Vector Machine for Imbalanced Data
YUAN Xing-Mei1,2,YANG Ming1,YANG Yang3
1.School of Computer Science and Technology,Nanjing Normal University,Nanjing 210023 2.Office of Information Construction and Management,Nanjing Institute of Technology,Nanjing 211167 3.Honor School,Nanjing Normal University,Nanjing 210023
Abstract:To improve the performance of Support Vector Machine(SVM) classifier for imbalanced data,an ensemble classifier model based on structural SVM is introduced by incorporating cost-sensitive strategy. In the proposed classifier model,the training data is partitioned into several group by Ward hierarchical clustering algorithm,the structure information hidden in data is obtained,and the weight of every sample is initialized by using the prior knowledge hidden in clusters. Furthermore,employing AdaBoost strategy,the weight of each sample is dynamically adjusted effectively,and the weights of minority class samples are relatively increased. Hence,the cost of the misclassified positive samples is also increased for improving the classification accuracy of positive samples(minority class samples). The experimental results show that the proposed model effectively improves the classification performance of the imbalanced data.
[1] Chew H G,Bogner R E,Lim C C. Dual-nu Support Vector Machine with Error Rate and Training Size Biasing // Proc of the 26th International Conference on Acoustics,Speech and Signal Processing. Salt Lake City,USA,2001: 1269-1272 [2] Wu G,Chang E Y. Kernel Boundary Alignment Considering Imbalance Data Distribution. IEEE Trans on Knowledge and Data Engineering,2005,17(6): 786-795 [3] Ricardo B,Valdovinos R M,Sanchez J S,et al. The Imbalanced Training Sample Problem: Under or over Sampling // Proc of the International Workshop on Structural,Syntactic,and Statistical Pattern Recognition. Lisbon,Portugal,2004: 804-806 [4] Fan Wei,Stolfo S J,Zhang Junxin,et al. AdaCost: Misclassification Cost-Sensitive Boosting // Proc of the 16th International Conference on Machine Learning. San Mateo,USA,1999: 97-105 [5] Vapnik V N. Statistical Learning Theory. New York,USA: John Wiley Sons,1998 [6] Zhang X G. Using Class-Center Vectors to Build Support Vector Machines // Proc of the IEEE Signal Processing Society Workshop. Madison,USA,1999: 3-11 [7] Inoue T,Abe S. Fuzzy Support Vector Machines for Pattern Classification // Proc of the International Joint Conference on Neural Networks. Washington,USA,2001: 1449-1454 [8] Zhang Qingqing. Research on Anomaly Detection for Imbalanced Data. Master Dissertation. Nanjing,China: Nanjing University of Aeronautics and Astronautics,2010 (in Chinese) (张青青.非平衡类的异常检测研究.硕士学位论文.南京:南京航空航天大学,2010) [9] Raskutti B,Kowalczyk A. Extreme Rebalancing for SVMs: A Case Study. ACM SIGKDD Explorations Newsletter,2004,6(1): 60-69 [10]Akbani R,Kwek S,Japkowicz N. Applying Support Vector Machines to Imbalanced Datasets // Proc of the 15th European Conference on Machine Learning. Pisa,Italy,2004: 39-50 [11] Liu Jinfu,Yu Daren. A Weighted Rough Set Method to Address the Class Imbalance Problem // Proc of the 6th International Conference on Machine Learning and Cybernetics.Hong Kong,China,2007: 3693-3698 [12] Ma Yuede,Du Zhe,Liu Sanyang. A New Noise-Immune Fuzzy SVM Algorithm for Imbalanced Data. Journal of Xi′an Technological University,2008,28(3): 297-300 (in Chinese) (马月德,杜 诘,刘三阳.用于不平衡数据的去噪模糊支持向量机.西安工业大学学报,2008,28(3): 297-300) [13] Liu Xuying,Wu Jianxin,Zhou Zhihua. Exploratory Under-Sampling for Class Imbalance Learning. IEEE Trans on Systems,Man and Cybernetics,2009,39(2): 539-550 [14] Freund Y,Schapire R E. A Decision-Theoretic Generalization of Online Learning and Application to Boosting. Journal of Computer and System Science,1997,55(1): 119-139 [15] Li Xuchun,Wang Lei,Sung E. AdaBoost with SVM-Based Component Classifiers. Engineering Applications of Artificial Intelligence,2008,21(5): 785-795 [16] Sun Yanmin,Kamel M S,Wong A K C,et al. Cost-Sensitive Boosting for Classification of Imbalanced Data. Pattern Recognition,2007,40(12): 3358-3378 [17] Valentini G,Dietterich T G. Bias-Variance Analysis of Support Vector Machines for the Development of SVM-Based Ensemble Methods. Journal of Machine Learning Research,2004,5: 725-775 [18] Pavlov D,Mao J,Dom B. Scaling-up Support Vector Machines Using Boosting Algorithm // Proc of the 15th International Conference on Pattern Recognition. Barcelona,Spain,2000,II: 219-222 [19] Kim H C,Pang Shaoning,Je H M,et al. Constructing Support Vector Machine Ensemble. Pattern Recognition,2003,36(12): 2757-2767 [20] Yuan Xingmei,Yang Ming.A Kind of StASVM Ensemble Algorithm for Unbalanced Data Sets. Journal of Nanjing Normal University: Natural Science Edition,2010,33(4),123-127 (in Chinese) (袁兴梅,杨 明.一种面向不平衡数据的结构化SVM集成算法.南京师范大学学报:自然科学版,2010,33(4): 123-127) [21] Duda R O,Hart P,Stock D G. Pattern Classification. 2nd Edition. New York,USA: Wiley-Interscience,2000