|
|
Documents Sampling Based on Feature Selection and Condensing Techniques |
HAO Xiu-Lan1,2, TAO Xiao-Peng1, WANG Shu-Yun1, XU He-Xiang1,3, HU Yun-Fa1 |
1.School of Computer Science and Technology, Fudan University, Shanghai 200433 2.School of Information Engineering, Huzhou Teachers College, Huzhou 313000 3.Shanghai Tele-Education Group, Shanghai 200092 |
|
|
Abstract As an instance based classifier, kNN has many computational and store requirements. Meanwhile, the poor performance of kNN classifier is caused by the imbalance distribution of training data. Aiming at these defects of kNN classifier, a technique, combining feature selection and condensing, is proposed to reduce the time cost and the space of kNN classifier. The proposed algorithm is divided into two steps. Firstly, several traditional methods of feature selection are combined to form features for each class. Then, redundant cases are removed by combination of class features contained in samples with Condensing algorithm. Experimental results indicate when the sample set acquired by the proposed method is used as training set, the classifier saves the time cost and the space dramatically, and the performance of the kNN classifier is improved because noisy data are removed from the training set.
|
Received: 18 December 2007
|
|
|
|
|
[1] Su Jinshu, Zhang Bofeng, Xu Xin. Advances in Machine Learning Based Text Categorization. Journal of Software, 2006, 17(9): 1848-1859 (in Chinese) (苏金树,张博锋,徐 昕.基于机器学习的文本分类技术研究进展.软件学报, 2006, 17(9): 1848-1859) [2] Li Ronglu, Hu Yunfa. Noise Reduction to Text Categorization Based on Density for kNN // Proc of the International Conference on Machine Learning and Cybernetics. Xi'an, China, 2003, Ⅴ: 3119-3124 [3] Guan Jihong, Zhou Shuigeng. Pruning Training Corpus to Speedup Text Classification // Proc of the 13th International Conference on Database and Expert Systems Applications. Zurich, Switzerland, 2002: 831-840 [4] Kuncheva L I, Jain L C. Nearest Neighbor Classifier: Simultaneous Editing and Feature Selection. Pattern Recognition Letters, 1999, 20(11/12/13): 1149-1156 [5] Hart P E. The Condensed Nearest Neighbor Rule. IEEE Trans on Information Theory, 1968, 14(3): 515-516 [6] Wilson D L. Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE Trans on Systems, Man and Cybernetics, 1972, 2(3): 408-421 [7] Devijver P, Kittler J. Pattern Recognition: A Statistical Approach. Englewood Cliffs, USA: Prentice Hall, 1982 [8] Kuncheva L I. Fitness Functions in Editing k-NN Reference Set by Genetic Algorithms. Pattern Recognition, 1997, 30(6): 1041-1049 [9] Li Yuangui, Huang Jinjie, Zhang Weidong, et al. New Prototype Selection Rule Integrated Condensing with Editing Process for the Nearest Neighbor Rules // Proc of the IEEE International Conference on Industrial Technology. Hongkong, China, 2005: 950-954 [10] Wilson D R, Martinez T R. Reduction Techniques for Instance-Based Learning Algorithms. Machine Learning, 2000, 38(3): 257-286 [11] Sanchez J S, Sotoca J M, Pla E. Efficient Nearest Neighbor Classification with Data Reduction and Fast Search Algorithms // Proc of the IEEE International Conference on Systems, Man and Cybernetics. Hague, Netherlands, 2004, Ⅴ: 4757-4762 [12] Vázquez F, Sánchez J S, Pla F. A Stochastic Approach to Wilson's Editing Algorithm // Proc of the 2nd Iberian Conference on Pattern Recognition and Image Analysis. Estoril, Portugal, 2005: 35-42 [13] Ferri F J, Albert J V, Vidal E. Consideration about Sample-Size Sensitivity of a Family of Edited Nearest-Neighbor Rules. IEEE Trans on Systems, Man and Cybernetics, 1999, 29(5): 667-672 [14] Zhao Hua, Zhao Tiejun, Yu Hao, et al. English Topic Tracking Research Based on Query Vector. Journal of Computer Research and Development, 2007, 44(8): 1412-1417 (in Chinese) (赵 华,赵铁军,于 浩,等.基于查询向量的英语话题跟踪研究.计算机研究与发展, 2007, 44(8): 1412-1417) [15] Li Ronglu. The Key Techniques Research on Text Categorization. Ph.D Dissertation. Shanghai, China: Fudan University. Department of Computer and Information Technology, 2005 (in Chinese) (李荣陆.文本分类若干关键技术研究.博士学位论文.上海:复旦大学.计算机与信息技术系, 2005) [16] Research on Feature Extraction of Text [EB/OL]. [2008-05-23]. http://blog.csdn.net/tvetve/archive/2008/04/14/2292111.aspx (in Chinese) (文本特征提取方法研究[EB/OL]. [2008-05-23]. http://blog.csdn.net/tvetve/archive/2008/04/14/2292111.aspx) [17] Hao Xiulan, Zhang Chenghong, Wang Shuyun, et al. Efficient kNN Text Categorization Based on Multi-Edit and Condensing Techniques // Proc of the International Conference on Machine Learning and Cybernetics. Hongkong, China, 2007, Ⅵ: 3571 -3576 [18] Chawla N V, Japkowicz N, Kotcz A. Editorial: Special Issue on Learning from Imbalanced Data Sets. ACM SIGKDD Explorations Newsletters, 2004, 6(1):1-6 [19] Weiss G M. Mining with Rarity: A Unifying Framework. ACM SIGKDD Explorations Newsletters, 2004, 6(1): 7-19 [20] Tan Songbo, Wang Yuefen. Chinese Corpus of Text Categorization—TanCorp V1.0 [DB/OL]. [2008-05-23]. http://www.searchforum.org.cn/tansongbo/corpus.htm (in Chinese) (谭松波,王月粉.中文文本分类语料库——TanCorpV1.0 [DB/OL]. [2008-05-23]. http://www.searchforum.org.cn/tansongbo/corpus.htm) [21] Tan Songbo, Cheng Xueqi, Ghanem M M, et al. A Novel Refinement Approach for Text Categorization // Proc of the 14th ACM International Conference on Information and Knowledge Management. Bremen, Germany, 2005: 469-476 [22] Jo T, Japkowicz N. Class Imbalances versus Small Disjuncts. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 40-49 |
|
|
|