Abstract:K Nearest Neighbor (KNN) is a simple, stable and effective supervised classification algorithm in machine learning and is used in many practical applications. Its complexity increases with the number of instances, and thus it is not practicable for large-scale or high dimensional data. In this paper, an improved KNN algorithm based on variable parameter rough set model (RSKNN) is proposed. By introducing the concept of upper and lower approximations in variable precision rough set model, the instances of each class are classified into core and boundary areas, and the distribution of the training set is obtained. For a new instance, RSKNN firstly computes the area it belongs to. Then, according to the area information, the algorithm determines the category directly or searches k-nearest neighbors among the related areas instead of all areas. In this way, the computing cost is reduced and the robustness is enhanced. The experimental results for selected UCI datasets show that the proposed method is more effective than the traditional KNN with high classification accuracy.
[1] Witten L H,Frank E.Data Mining: Practical Machine Learning Tools and Techniques.2nd Edition.New York ,USA: Elsevier,2005 [2] Wu Xindong,Kumar V,Quinlan J R,et al.Top 10 Algorithms in Data Mining.Knowledge and Information Systems,2008,14(1): 1-37 [3] Hu Yan,Wu Huzi,Zhong Luo.Research of Chinese Web Classification Method Based on Improved KNN Algorithm.Journal of Wuhan University: Engineering,2007,40(4): 141-144 (in Chinese) (胡 燕,吴虎子,钟 珞.基于改进的KNN算法的中文网页自动分类方法研究.武汉大学学报:工学版,2007,40(4): 141-144) [4] Li Ronglu,Hu Yunfa.A Density-Based Method for Reducing the Amount of Training Data in KNN Text Classification.Journal of Computer Research and Development,2004,41(4): 539-545 (in Chinese) (李荣陆,胡运发.基于密度的KNN文本分类器训练样本裁剪方法.计算机研究与发展,2004,41(4): 539-545) [5] Zhang Xiaofei,Huang Heyan.An Improved KNN Text Categorization Algorithm by Adopting Cluster Technology.Pattern Recognition and Artificial Intelligence,2009,22(6): 936-940 (in Chinese) (张孝飞,黄河燕.一种采用聚类技术改进的KNN文本分类方法.模式识别与人工智能,2009,22(6): 936-940) [6] Wang Yu,Bai Shi,Wang Zhengou.A Fast KNN Algorithm Applied to Web Text Categorization.Journal of the China Society for Scientific and Technical Information,2007,26(1): 60-64(in Chinese) (王 煜,白 石,王正欧.用于Web文本分类的快速KNN算法.情报学报,2007,26(1): 60-64) [7] Ziarko W.Variable Precision Rough Sets Model.Journal of Computer and System Science,1993,46(1): 39-59 [8] Zhang Nan,Miao Duoqian,Yue Xiaodong. Approaches to Knowledge Reduction in Interval-Valued Information System,2010,47(8): 1362-1371 (in Chinese) (张 楠,苗夺谦,岳晓冬.区间值信息系统的知识约简.计算机研究与发展,2010,47(8): 1362-1371) [9] Pawlak Z.Rough Sets.International Journal of Computer and Information Science,1982,11(3): 341- 356. [10] Katzberg J D,Ziarko W.Variable Precision Rough Sets with Asymmetric Bounds // Proc of the International Workshop on Rough Sets and Knowledge Discovery: Rough Sets,Fuzzy Sets and Knowledge Discovery.Banff,Canada,1993: 167-177 [11] Dai Liuling,Huang Heyan,Chen Zhaoxiong.A Comparative Study on Feature Selection in Chinese Text Categorization.Journal of Chinese Information Processing,2004,18(1): 26-32 (in Chinese) (代六玲,黄河燕,陈肇雄.中文文本分类中特征抽取方法的比较研究.中文信息学报,2004,18(1): 26-32) [12] Lingras P,Chad W.Interval Set Clustering of Web Users with Rough K-Means.Journal of Intelligent Information Systems,2004,23(1): 5-16 [13] Falcón R,Jeon G,Lee K,et al.Mechanisms of Partial Supervision in Rough Clustering Approaches // Proc of the 4th International Conference on Rough Sets and Knowledge Technology.Gold Coast,Australia,2009: 33-45 [14] Pawan L.Unsupervised Rough Set Classification Using Gas.Journal of Intelligent Information Systems,2001,16(3): 215-228