A Kernel Fisher Linear Discriminant Analysis Approach Aiming at Imbalanced Data Set
YIN Jun-Mei,YANG Ming,WAN Jian-Wu
School of Computer Science and Technology,Nanjing Normal University,Nanjing 210097 Jiangsu Research Center of Information Security Confidential Engineering,Nanjing 210097
Abstract:In practical real applications lots of classification questions are aiming at imbalanced data sets, while these unbalanced data will lead to the descending of the classification performance of many classifiers. In this paper the classification mechanism based on kernel fisher linear discriminant analysis (KFDA) is introduced, and then the reasons that the unbalanced data cause KFDA to turn ineffective is analyzed. Therefore, a weighted kernel fisher linear discriminant analysis (WKFDA) method is proposed. The method balances the contributions from kernel covariance matrices of two classes of sample to the kernel within-class scatter matrix and can constrain the influence of unbalanced data on classification performance. The experiments on 7 UCI datasets are performed to further test the performance of our algorithm. The experimental results show that the developed approach can effectively improve the classification performance of the proposed classifier.
[1] Sun Yanmin, Kamela M S, Wong A K C, et al. Cost-Sensitive Boosting for Classification of Imbalanced Data. Pattern Recognition, 2007, 40(12): 3358-3378 [2] Chan P K, Stolfo S J. Toward Scalable Learning with Non-Uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection // Proc of the 4th International Conference on Knowledge Discovery and Data Mining. New York, USA, 1998: 164-168 [3] Weiss G M, Hirsh H. Learning to Predict Rare Events in Event Sequences // Proc of the 4th International Conference on Knowledge Discovery and Data Mining. New York, USA, 1998: 359-363 [4] Atiya A F. Bankruptcy Prediction for Credit Risk Using Neural Network: A Survey and New Results. IEEE Trans on Neural Networks, 2001, 12(4): 929-935 [5] Kubat M, Holte R C, Matwin S. Machine Learning for the Detection of Oil Spills in Satellite Radar Images. Machine Learning, 1998, 30(2/3): 195-215 [6] Maloof M A. Learning When Data Sets Are Imbalanced and When Costs Are Unequal and Unknown // Proc of the Workshop on Learning from Imbalanced Data Sets. Washington, USA, 2003: 73-80 [7] Kubat M, Matwin S. Addressing the Curse of Imbalanced Training Sets: One-Sided Selection // Proc of the 14th International Conference on Machine Learning. San Francisco, USA,1997: 179-186 [8] Chawla N N, Bowyer K W, Kegelmeyer W P. SMOTE: Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research, 2002, 16: 321-357 [9] Josh I M, Kumar V, Agarwal R. Evaluating Boosting Algorithms to Classify Rare Classes: Comparison and Improvements // Proc of the 1st IEEE International Conference on Data Mining. San Jose, USA, 2001: 257-264 [10] Edward Y, Wu Changgang. Class-Boundary Alignment for Imbalanced Dataset Learning // Proc of the Workshop on Learning from Imbalanced Datasets. Washington, USA, 2003: 49-56 [11] Huang Kaizhu, Yang Haiqin, King I, et al. Learning Classifiers from Imbalanced Data Based on Biased Minimax Probability Machine // Proc of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Washington, USA, 2004: 558-563 [12] Zhou Quan, Wang Chongjun, Wang Jun, et al. PC 4.5: Improved C 4.5 Algorithm Applied in Imbalanced Dataset. Computer Aided Engineering, 2006, 15(3): 23-26 (in Chinese) (周 荃,王崇骏,王 珺,等.PC 4.5:用于不均衡数据集的C 4.5改进算法.计算机辅助工程, 2006, 15(3): 23-26) [13] Xiao Jianhua, Wu Jinpei. SVM Model with Unequal Sample Number between Classes. Computer Science, 2003, 30(2): 165-167 (in Chinese) (肖健华,吴今培.样本数目不对称时的SVM模型.计算机科学, 2003, 30(2): 165-167) [14] Xie Jigang, Qiu Zhengding. Fisher Linear Discriminant Model with Class Imbalance. Journal of Beijing Jiaotong University, 2006, 30(5): 15-18 (in Chinese) (谢纪刚,裘正定.非平衡数据集Fisher线性判别模型.北京交通大学学报:自然科学版, 2006, 30(5): 15-18) [15] Chawla N V, Lazarevic A, Hall L O, et al. SMOTEBoost: Improving Prediction of the Minority Class in Boosting // Proc of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases. Cavtat-Dubrovik, Croatie, 2003: 107-119 [16] Mika S, Ratsch G, Weston J, et al. Fisher Discrminant Analysis with Kernels // Proc of the Nerual Networks for Signal Processing Workshop. Madison, USA, 1999: 41-48 [17] Schlkopf B, Mika S, Burges C J C, et al. Input Space Versus Feature Space in Kernel-Based Methods. IEEE Trans on Neural Networks, 1999, 10(5): 1000-1016 [18] Gan Junying, Zhang Youwei. Generalized Kernel Fisher Optimal Discriminant in Pattern Recognition. Pattern Recognition and Artificial Intelligence, 2002, 15(4): 429-433 (in Chinese) (甘俊英,张有为.模式识别中广义核函数Fisher最佳鉴别.模式识别与人工智能, 2002, 15(4): 429-433) [19] Bian Zhaoqi, Zhang Xuegong. Pattern Recognition. Beijing, China: Tsinghua University Press, 2001 (in Chinese) (边肇棋,张学工.模式识别.北京:清华大学出版社, 2001) [20] Han Hui, Wang Wenyuan, Mao Binghuan. Over-Sampling Algorithm Based on Adaboost in Unbalanced Data Set. Computer Engineering, 2007, 33(10): 207-209 (in Chinese) (韩 慧,王文渊,毛炳寰.不均衡数据集中基于Adaboost的过抽样算法.计算机工程, 2007, 33(10): 207-209)