Text Feature Selection Method for Hierarchical Classification
ZHU Cui-Ling1,2, MA Jun1, ZHANG Dong-Mei1
1.School of Computer Science and Technology, Shandong University, Jinan 250101 2.School of Information Management, Shandong Economic University, Jinan 250014
Abstract:An approach of feature selection for hierarchical classification is proposed. Firstly, the concept of category hierarchical correlation degree is introduced and it is calculated according to the category tree and the probability distribution of training data on different levels. Then, the importance degrees of categories are computed according to hierarchical correlation degree. Finally, the discriminative abilities of features are calculated based on the previous computation and the features with the greater discriminative ability are chosen as the feature set for classification. Experimental results show that the proposed approach outperforms the traditional feature selection methods on both quality of the features selected and standard classification metrics in terms of accuracy, F1 and micro-precision.
[1] Sun Jixiang. Modern Pattern Recognition. Changsha, China: National University of Defense Technology Press, 2002 (in Chinese) (孙即祥.现代模式识别.长沙:国防科技大学出版社,2002) [2] Liu Huan, Yu Lei. Toward Integrating Feature Selection Algorithms for Classification and Clustering. IEEE Trans on Knowledge and Data Engineering, 2005, 17(4): 491-502 [3] Yang Yiming, Pedersen J O. A Comparative Study on Feature Selection in Text Categorization // Proc of the 14th International Conference on Machine Learning. Nashville, USA, 1997: 412-420 [4] Yang S M, Wu Xiaobin, Deng Zhihong, et al. Relative Term-Frequency Based Feature Selection for Text Categorization // Proc of the 1st International Conference of Machine Learning and Cybernetics. Beijing, China, 2002: 1432-1436 [5] Dumais S T, Chen Hao. Hierarchical Classification of Web Content // Proc of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Athers, Greece, 2000: 256-263 [6] Yu Lei, Ding C, Loscalzo S. Stable Feature Selection via Dense Feature Groups // Proc of the 14th ACM SIG-KDD International Conference on Knowledge Discovery and Data Mining. Las Vegas, USA, 2008: 803-811 [7] Peng Hanchuan, Long Fuhui, Ding C. Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max Relevance, and Min-Redundancy. IEEE Trans on Pattern Analysis and Machine Intelligence, 2005, 27(8): 1226-1238 [8] Xu Yan, Li Jintao, Wang Bin, et al. A Category Resolve Power-Based Feature Selection Method. Journal of Software, 2008, 19(1): 82-89(in Chinese) (徐 燕,李锦涛,王 斌,等.基于区分类别能力的高性能特征选择方法.软件学报, 2008, 19(1): 82-89) [9] Alessio S D, Murray K, Schiaffino R, et al. The Effect of Using Hierarchical Classifiers in Text Categorization // Proc of the 6th International Conference on Content-Based Multimedia Information Access. Paris, France, 2000: 302-313 [10] Cui Zifeng, Xu Baowen, Zhang Weifeng, et al. A New Approach to Feature Selection for Text Categorization. Wuhan University Journal of Natural Sciences, 2006, 11(5): 1335-1339 [11] Zhao Shiqi, Zhang Yu, Liu Ting, et al. A Feature Selection Method Based on Class Feature Domains for Text Categorization. Journal of Chinese Information Processing, 2005, 19(6): 21-27 (in Chinese) (赵世奇,张 宇,刘 挺,等.基于类别特征域的文本分类特征选择方法.中文信息学报, 2005, 19(6): 21-27) [12] Punera K, Rajan S, Ghosh J. Automatic Construction of N-Ary Tree Based Taxonomies // Proc of the 6th IEEE International Conference on Data Mining. Hongkong, China, 2006: 75-79 [13] Xing Dikan, Xue Guirong, Yang Qiang, et al. Deep Classifier: Automatically Categorizing Search Results into Large-Scale Hierarchies // Proc of the International Conference on Web Search and Web Data Mining. Palo Alto, USA, 2008: 139-148 [14] Dhillon I S, Mallela S, Kumar R. Enhanced Word Clustering for Hierarchical Text Classification // Proc of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Edmonton, Canada, 2002: 191-200 [15] Kullback S. Information Theory and Statistics. New York, USA: Dover Publications, 1968 [16] Beijing University. Training Set of Chinese Web Page Collection for Classification [DB/OL]. [2009-03-15]. http://www.cwirf.org/SharedRes/DataSet/cct.html (in Chinese) (北京大学.中文网页分类训练集[DB/OL]. [2009-03-15]. http://www.cwirf.org/SharedRes/DataSet/cct.html) [17] Lang K. 20 Newgroups Data Set [DB/OL]. [2009-04-10]. http:people.csail.mit.edu/jrennie/20Newsgroups [18] Dong Zhendong, Dong Qiang. Hownet [DB/OL]. [2009-03-15]. http://www.keenage.com (in Chinese) (董振东,董 强.知网[DB/OL]. [2009-03-15]. http://www.keenage.com) [19] The Natural Language Processing Research Group. WordNet [EB/OL]. [2009-04-10]. http://nlp.shef.ac.uk/result/software.html