Abstract:k-Nearest Neighbor (KNN) algorithm has the advantage of high accuracy and stability. But the time complexity of KNN is directly proportional to the sample size, its classification speed is low and it is problematic to be put into practice in large-scale information processing. An improved KNN text categorization algorithm is proposed which classifies faster than the traditional KNN does. Firstly, some similar sample documents are combined into a center document through adopting automatic text clustering technology. Then, a large number of original samples are replaced with the small amount of sample cluster centers. Therefore, the calculation amount of KNN is reduced greatly and the classification is speeded up. The experimental results show that the time complexity of the proposed algorithm is decreased by one order of magnitude and its accuracy is approximately equal to those of the SVM and traditional KNN.
[1] Lewis D D. Nave Bayes at Forty: The Independence Assumption in Information Retrieval // Proc of the 10th European Conference on Machine Learning. Chemnitz, Germany, 1998: 4-15 [2] Cohen W W, Singer Y. Context-Sensitive Learning Methods for Text Categorization // Proc of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Zurich, Switzerland, 1996: 307-315 [3] Joachims T. Text Categorization with Support Vector Machines: Learning with Many Relevant Features // Proc of the 10th European Conference on Machine Learning. Chemnitz, Germany, 1998: 137-142 [4] Nigam K, Lafferty J, McCallum A. Using Maximum Entropy for Text Classification // Proc of the Workshop on Machine Learning for Information Filtering. Stockholm, Sweden, 1999: 61-67 [5] Yang Yiming, Liu Xin. A Re-Examination of Text Categorization Methods // Proc of the 22nd Annual International ACM SIGIR Conference on Research and Development in the Information Retrieval. Berkeley, USA, 1999: 42-49 [6] Sebastiani F. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 2002, 34(1): 1-47 [7] Li Ronglu, Hu Yunfa. A Density-Based Method for Reducing the Amount of Training Data in KNN Text Classification. Journal of Computer Research and Development, 2004, 41(4): 539-545 (in Chinese) (李荣陆,胡运发.基于密度的KNN文本分类器训练样本裁剪方法.计算机研究与发展, 2004, 41(4): 539-545) [8] Hu Yan, Wu Huzi, Zhong Luo. Research of Chinese Web Classification Method Based on Improved KNN Algorithm. Engineering Journal of Wuhan University, 2007, 40(4): 141-144 (in Chinese) (胡 燕,吴虎子,钟 珞.基于改进的KNN算法的中文网页自动分类方法研究.武汉大学学报:工学版, 2007, 40(4): 141-144) [9] Wang Yu, Bai Shi, Wang Zheng'ou. A Fast KNN Algorithm Applied to Web Text Categorization. Journal of the China Society for Scientific and Technical Information, 2007, 26(1): 60-64 (in Chinese) (王 煜,白 石,王正欧.用于Web文本分类的快速KNN算法.情报学报, 2007, 26(1): 60-64) [10] Dai Liuling, Huang Heyan, Chen Zhaoxiong. A Comparative Study on Feature Selection in Chinese Text Categorization. Journal of Chinese Information Processing, 2004, 18(1): 26-32 (in Chinese) (代六玲,黄河燕,陈肇雄.中文文本分类中特征抽取方法的比较研究.中文信息学报, 2004, 18(1): 26-32) [11] Hull D A. Improving Text Retrieval for the Routing Problem Using Latent Semantic Indexing // Proc of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Dublin, Ireland, 1994: 282-289 [12] Joachims T. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization // Proc of the 14th International Conference on Machine Learning. Nashville, USA, 1997: 143-151 [13] Galavotti L, Sebastiani F, Simi M. Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization // Proc of the 4th European Conference on Research and Advanced Technology for Digital Libraries. Lisbon, Portugal, 2000: 59-68