一种采用聚类技术改进的KNN文本分类方法<sup>*</sup>

摘要
图/表
参考文献
相关文章 (15)

全文: PDF (0 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要 KNN算法稳定性好、准确率高，但由于其时间复杂度与样本数量成正比，导致其分类速度慢，难以在大规模海量信息处理中得到有效应用.文中提出一种改进的KNN文本分类方法.其基本思路是，通过文本聚类将样本中的若干相似文档合并成一个中心文档，并用这些中心文档代替原始样本建立分类模型，这样就减少了需要进行相似计算的文档数，从而达到提高分类速度的目的.实验表明，以分类准确率、召回率和F-score为评价指标，文中方法在与经典KNN算法相当的情况下，分类速度得到较大提高.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	张孝飞
	黄河燕

Abstract：k-Nearest Neighbor (KNN) algorithm has the advantage of high accuracy and stability. But the time complexity of KNN is directly proportional to the sample size, its classification speed is low and it is problematic to be put into practice in large-scale information processing. An improved KNN text categorization algorithm is proposed which classifies faster than the traditional KNN does. Firstly, some similar sample documents are combined into a center document through adopting automatic text clustering technology. Then, a large number of original samples are replaced with the small amount of sample cluster centers. Therefore, the calculation amount of KNN is reduced greatly and the classification is speeded up. The experimental results show that the time complexity of the proposed algorithm is decreased by one order of magnitude and its accuracy is approximately equal to those of the SVM and traditional KNN.

Key words： k-Nearest Neighbor (KNN) Text Categorization Text Clustering Cluster Center Natural Language Processing (NLP)

收稿日期: 2008-10-31

ZTFLH:

TP391

基金资助:国家自然科学基金项目(No.60672149)、国家863计划项目(No.2006AA010109)资助

作者简介: 张孝飞，男，1970年生，研究员，博士，主要研究方向为自然语言处理、机器翻译、信息检索.E-mail: zxflying@gmail.com.黄河燕，女，1963年生，研究员，博士生导师，主要研究方向为自然语言处理与机器翻译、大型智能应用系统.

引用本文:

张孝飞，黄河燕. 一种采用聚类技术改进的KNN文本分类方法^*[J]. 模式识别与人工智能, 2009, 22(6): 936-940. ZHANG Xiao-Fei, HUANG He-Yan. An Improved KNN Text Categorization Algorithm by Adopting Cluster Technology. , 2009, 22(6): 936-940.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/ 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2009/V22/I6/936