基于特征选择及Condensing技术的文本取样<sup>*</sup>

摘要
图/表
参考文献
相关文章 (15)

全文: PDF (612 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要作为一种基于实例的方法，k-近邻(kNN)分类器有大量的计算及存储需求.同时，训练数据分布的不均衡，也会导致kNN分类器的性能下降.针对这些缺陷，文中提出特征选择与Condensing技术相结合的取样方法，以达到下述目的.在减少kNN分类的计算量及存储量的同时，保证分类器的性能.首先由传统的特征选择方法产生训练集里每类训练数据的特征.再根据文档自身的类特征，结合Condensing策略移去多余的训练实例.大量实验表明，用该方法所取得的样本作为训练集，不仅极大减少kNN方法的时空开销，而且降低噪声，提高分类器性能.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	郝秀兰
	陶晓鹏
	王述云
	徐和祥
	胡运发

关键词 ：文本分类, k-近邻(kNN), 取样, 特征选择, Condensing算法

Abstract：As an instance based classifier, kNN has many computational and store requirements. Meanwhile, the poor performance of kNN classifier is caused by the imbalance distribution of training data. Aiming at these defects of kNN classifier, a technique, combining feature selection and condensing, is proposed to reduce the time cost and the space of kNN classifier. The proposed algorithm is divided into two steps. Firstly, several traditional methods of feature selection are combined to form features for each class. Then, redundant cases are removed by combination of class features contained in samples with Condensing algorithm. Experimental results indicate when the sample set acquired by the proposed method is used as training set, the classifier saves the time cost and the space dramatically, and the performance of the kNN classifier is improved because noisy data are removed from the training set.

Key words： Text Categorization k-Nearest Neighbor (kNN) Sampling Feature Selection Condensing Algorithm

收稿日期: 2007-12-18

ZTFLH:

TP391

基金资助:国家自然科学基金资助项目(No. 60736016)

作者简介: 郝秀兰，女，1970年生，博士，主要研究方向为全文数据库、自然语言处理.E-mail: xiulanhao@fudan.edu.cn.陶晓鹏，男，1970年生，副教授，博士后，主要研究方向为文本管理、自然语言处理.王述云，女，1976年生，博士，主要研究方向为数据挖掘.徐和祥，男，1972年生，博士，主要研究方向为Deep Web、数据挖掘.胡运发，男，1940年生，教授，博士生导师，主要研究方向为数据工程、知识工程.

引用本文:

郝秀兰，陶晓鹏，王述云，徐和祥，胡运发. 基于特征选择及Condensing技术的文本取样^*[J]. 模式识别与人工智能, 2009, 22(5): 709-717. HAO Xiu-Lan, TAO Xiao-Peng, WANG Shu-Yun, XU He-Xiang, HU Yun-Fa. Documents Sampling Based on Feature Selection and Condensing Techniques. , 2009, 22(5): 709-717.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/ 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2009/V22/I5/709

[1] Su Jinshu, Zhang Bofeng, Xu Xin. Advances in Machine Learning Based Text Categorization. Journal of Software, 2006, 17(9): 1848-1859 (in Chinese)
(苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展.软件学报, 2006, 17(9): 1848-1859)
[2] Li Ronglu, Hu Yunfa. Noise Reduction to Text Categorization Based on Density for kNN // Proc of the International Conference on Machine Learning and Cybernetics. Xi'an, China, 2003, Ⅴ: 3119-3124
[3] Guan Jihong, Zhou Shuigeng. Pruning Training Corpus to Speedup Text Classification // Proc of the 13th International Conference on Database and Expert Systems Applications. Zurich, Switzerland, 2002: 831-840
[4] Kuncheva L I, Jain L C. Nearest Neighbor Classifier: Simultaneous Editing and Feature Selection. Pattern Recognition Letters, 1999, 20(11/12/13): 1149-1156
[5] Hart P E. The Condensed Nearest Neighbor Rule. IEEE Trans on Information Theory, 1968, 14(3): 515-516
[6] Wilson D L. Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE Trans on Systems, Man and Cybernetics, 1972, 2(3): 408-421
[7] Devijver P, Kittler J. Pattern Recognition: A Statistical Approach. Englewood Cliffs, USA: Prentice Hall, 1982
[8] Kuncheva L I. Fitness Functions in Editing k-NN Reference Set by Genetic Algorithms. Pattern Recognition, 1997, 30(6): 1041-1049
[9] Li Yuangui, Huang Jinjie, Zhang Weidong, et al. New Prototype Selection Rule Integrated Condensing with Editing Process for the Nearest Neighbor Rules // Proc of the IEEE International Conference on Industrial Technology. Hongkong, China, 2005: 950-954
[10] Wilson D R, Martinez T R. Reduction Techniques for Instance-Based Learning Algorithms. Machine Learning, 2000, 38(3): 257-286
[11] Sanchez J S, Sotoca J M, Pla E. Efficient Nearest Neighbor Classification with Data Reduction and Fast Search Algorithms // Proc of the IEEE International Conference on Systems, Man and Cybernetics. Hague, Netherlands, 2004, Ⅴ: 4757-4762
[12] Vázquez F, Sánchez J S, Pla F. A Stochastic Approach to Wilson's Editing Algorithm // Proc of the 2nd Iberian Conference on Pattern Recognition and Image Analysis. Estoril, Portugal, 2005: 35-42
[13] Ferri F J, Albert J V, Vidal E. Consideration about Sample-Size Sensitivity of a Family of Edited Nearest-Neighbor Rules. IEEE Trans on Systems, Man and Cybernetics, 1999, 29(5): 667-672
[14] Zhao Hua, Zhao Tiejun, Yu Hao, et al. English Topic Tracking Research Based on Query Vector. Journal of Computer Research and Development, 2007, 44(8): 1412-1417 (in Chinese)
(赵华,赵铁军,于浩,等.基于查询向量的英语话题跟踪研究.计算机研究与发展, 2007, 44(8): 1412-1417)
[15] Li Ronglu. The Key Techniques Research on Text Categorization. Ph.D Dissertation. Shanghai, China: Fudan University. Department of Computer and Information Technology, 2005 (in Chinese)
(李荣陆.文本分类若干关键技术研究.博士学位论文.上海:复旦大学.计算机与信息技术系, 2005)
[16] Research on Feature Extraction of Text [EB/OL]. [2008-05-23]. http://blog.csdn.net/tvetve/archive/2008/04/14/2292111.aspx (in Chinese)
(文本特征提取方法研究[EB/OL]. [2008-05-23]. http://blog.csdn.net/tvetve/archive/2008/04/14/2292111.aspx)
[17] Hao Xiulan, Zhang Chenghong, Wang Shuyun, et al. Efficient kNN Text Categorization Based on Multi-Edit and Condensing Techniques // Proc of the International Conference on Machine Learning and Cybernetics. Hongkong, China, 2007, Ⅵ: 3571 -3576
[18] Chawla N V, Japkowicz N, Kotcz A. Editorial: Special Issue on Learning from Imbalanced Data Sets. ACM SIGKDD Explorations Newsletters, 2004, 6(1):1-6
[19] Weiss G M. Mining with Rarity: A Unifying Framework. ACM SIGKDD Explorations Newsletters, 2004, 6(1): 7-19
[20] Tan Songbo, Wang Yuefen. Chinese Corpus of Text Categorization—TanCorp V1.0 [DB/OL]. [2008-05-23]. http://www.searchforum.org.cn/tansongbo/corpus.htm (in Chinese)
(谭松波,王月粉.中文文本分类语料库——TanCorpV1.0 [DB/OL]. [2008-05-23]. http://www.searchforum.org.cn/tansongbo/corpus.htm)
[21] Tan Songbo, Cheng Xueqi, Ghanem M M, et al. A Novel Refinement Approach for Text Categorization // Proc of the 14th ACM International Conference on Information and Knowledge Management. Bremen, Germany, 2005: 469-476
[22] Jo T, Japkowicz N. Class Imbalances versus Small Disjuncts. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 40-49