In the real-world problems, there is an imbalance in the paired-samples. The number of the paired-samples in similarity set is much smaller than the number of the paired-samples in dissimilarity set. To solve this problem, two approaches, dissimilar K nearest neighbor and similar K nearest neighbor (DKNN-SKNN) and dissimilar K nearest neighbor and similar K farthest neighbor (DKNN-SKFN), are proposed to construct paired-samples. Thus, the number of paired-samples in similarity learning is effectively decreased, the training process of SVM is accelerated, and the imbalanced data problem is solved to some degree. In the experiments, the proposed approaches are compared with some standard resampling methods. The results show that the proposed approaches have better performance.
[1] Cover T M, Hart P E. Nearest Neighbor Pattern Classification. IEEE Trans on Information Theory, 1967, 13(1): 21-27 [2] Burges C J C. A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 1998, 2(2):121-167 [3] Cristianini N, Shawe-Taylor J. Support Vector Machines. Cambridge, UK: Cambridge University Press, 2000 [4] Zhang L. Research on Support Vector Machines and Kernel Me-thods. Ph.D Dissertation. Xi′an, China: Xidian University, 2009 (in Chinese) (张 莉.支撑矢量机与核方法研究.博士学位论文.西安:西安电子科技大学, 2009) [5] Chawla N V, Japkowicz N, Kolcz A. Editorial: Special Issue on Learning from Imbalanced Data Sets. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 1-6 [6] Weiss G M, Provost F. The Effect of Class Distribution on Classifier Learning: An Empirical Study. Technical Report, ML-TR-43. New Brunswick, USA: Rutgers University, 2001 [7] Laurikkala J. Improving Identification of Difficult Small Classes by Balancing Class Distribution // Proc of the 8th Conference on AI in Medicine in Europe: Artificial Intelligence Medicine. Cascais, Portugal, 2001: 63-66 [8] Estabrooks A, Jo T, Japkowicz N. A Multiple Resampling Method for Learning from Imbalanced Data Sets. Computational Intelligence, 2004, 20(1): 18-36 [9] Weiss G M. Mining with Rarity: A Unifying Framework. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 7-19 [10] Kubat M, Matwin S. Addressing the Curse of Imbalanced Training Sets: One-Sided Selection // Proc of the 14th International Confe-rence on Machine Learning. Nashville, USA, 1997: 179-186 [11] Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research, 2002, 16: 321-357 [12] Han H, Wang W Y, Mao B H. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning // Proc of the International Conference on Intelligent Computing. Hefei, China, 2005, I: 878-887 [13] Mease D, Wyner A J, Buja A. Boosted Classification Trees and Class Probability/Quantile Estimation. Journal of Machine Lear-ning Research, 2007, 8(3): 409-439 [14] Zhang J, Mani I. KNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction // Proc of the International Conference on Machine Learning: Workshop on Learning from Imbalanced Datasets. Washington, USA, 2003: 42-48 [15] Joshi M V, Kumar V, Agarwal R C. Evaluating Boosting Algorithms to Classify Rare Classes: Comparison and Improvements // Proc of the IEEE International Conference on Data Mining. San Jose, USA, 2001: 257-264 [16] Wu G, Chang E Y. Class-Boundary Alignment for Imbalanced Dataset Learning // Proc of the International Conference on Machine Learning: Workshop on Learning from Imbalanced Datasets. Washington, USA, 2003: 49-56 [17] Raskutti B, Kowalczyk A. Extreme Re-balancing for SVMs: A Case Study. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 60-69 [18] Schlkopf B, Platt J C, Shawe-Taylor J, et al. Estimating the Support of a High-Dimensional Distribution. Neural Computation, 2001, 13(7): 1443-1471 [19] Manevitz L M, Yousef M. One-Class SVMs for Document Classification. Journal of Machine Learning Research, 2001, 2: 139-154 [20] Zhuang L, Dai H H. Parameter Estimation of One-Class SVM on Imbalance Text Classification // Proc of the 19th Conference of the Canadian Society for Computational Studies of Intelligence. Quebec City, Canada, 2006: 538-549 [21] Lee H J, Cho S. The Novelty Detection Approach for Different Degrees of Class Imbalance // Proc of the 13th International Conference on Neural Information Processing. Hong Kong, China, 2006, II: 21-30 [22] Zhuang L, Dai H H. Parameter Optimization of Kernel-Based One-Class Classifier on Imbalance Text Learning // Proc of the 9th Pacific Rim International Conference on Artificial Intelligence. Gui-lin, China, 2006: 434-443 [23] Japkowicz N. Supervised versus Unsupervised Binary-Learning by Feedforward Neural Networks. Machine Learning, 2001, 42(1/2): 97-122
[24] Manevitz L, Yousef M. One-Class Document Classification via Neural Networks. Neurocomputing, 2007, 70(7/8/9): 1466-1481 [25] Japkowicz N. Learning from Imbalanced Data Sets: A Comparison of Various Strategies[EB/OL]. [2012-06-30]. http://sci2s.ugr.es/keel/pdf/specific/congreso/aaai2000-workshop.pdf [26] Japkowicz N, Myers C, Gluck M. A Novelty Detection Approach to Classification // Proc of the 14th International Joint Conferences on Artificial Intelligence. Montreal, Canada, 1995, I: 518-523 [27] Phillips P J. Support Vector Machines Applied to Face Recognition [EB/OL]. [2012-06-30]. http://papers.nips.cc/paper/1609-support-vector-machines-applied-to-face-recognition.pdf [28] Melacci S, Sarti L, Maggini M, et al. A Neural Network Approach to Similarity Learning // Proc of the 3rd IAPR Workshop on Artificial Neural Networks in Pattern Recognition. Paris, France, 2008: 133-136 [29] Wright J, Yang A Y, Ganesh A, et al. Robust Face Recognition via Sparse Representation. IEEE Trans on Pattern Analysis and Machine Intelligence, 2009, 31(2): 210-227 [30] Zhang L, Zhou W D, Chang P C, et al. Kernel Sparse Representation-Based Classifier. IEEE Trans on Signal Processing, 2012, 60(4): 1684-1695