Document Feature Selection Based on the Minimum Term Frequency Threshold
CHEN XiaoYun1,2, LI RongLu1, HU YunFa1
1.Department of Computer and Information Technology, Fudan University, Shanghai 200433 2.School of Mathematics and Computer Science, Fuzhou University, Fuzhou 350002
Abstract:In this paper, a novel method of feature evaluation function based on document frequency with the minimum term frequency threshold (DFn) is presented. To decrease the influence of the unrelated features on the system of text categorization, the attribute of the unrelated features is analyzed and the term frequency of the unrelated feature is commonly low. By applying minimum term frequency to filter the low frequency features, the unrelated features are obviously decreased. The experimental results validate the proposed method greatly reduces the number of the unrelated features and effectively improves the accuracy of the text categorization. The improvement to Mutual Information(MI) is very obvious, the Macroaverage F1 value based on DFn is 40% higher than that of Term Frequency, and 15~30% higher than that of Document Frequency(DF).
陈晓云,李荣陆,胡运发. 基于最小词频阈值的文档特征选择*[J]. 模式识别与人工智能, 2006, 19(4): 531-537.
CHEN XiaoYun, LI RongLu, HU YunFa. Document Feature Selection Based on the Minimum Term Frequency Threshold. , 2006, 19(4): 531-537.
[1] Zhou S G, Guan J H, Hu Y F, et al. A Chinese Document Categorization System without Dictionary Support and Segmentation Processing. Journal of Computer Research and Development, 2001, 38 (7): 839-844 (in Chinese) (周水庚,关佶红,胡运发,等. 一个无需词典支持和切词处理的中文文档分类算法.计算机研究与发展, 2001, 38(7): 839-844) [2]Wu X Q, Wu L D, et al. A Machine Learning Based Word Segmentation System without Manual Dictionary. Pattern Recognition and Artificial Intelligence, 1996, 9(4): 297-303 (in Chinese) (黄萱菁,吴立德,等. 基于机器学习的无需人工编制词典的切词系统.模式识别与人工智能,1996, 9(4): 297-303) [3]Yang Y M, Pedersen J O. A Comparative Study on Feature Selection in Text Categorization. In: Proc of the 14th International Conference on Machine Learning. Nashville, USA, 1997, 412-420 [4]Mladenic' D, Grobelnik M. Feature Selection on Hierarchy of Web Documents. Decision Support Systems, 2003, 35(1): 45-87 [5]Rogat M, Yang Y M. High-Performing Feature Selection for Text Classification. In: Proc of the 11th International Conference on Information and Knowledge Management. McLean, USA, 2002, 659-661 [6]Chen Z P, Lin Y P, Peng Y, et al. A Irrelevant Information Preprocess Based on the Minimal Class Difference. Acta Electronica Sinica, 2003, 31(11): 1750-1753 (in Chinese) (陈治平,林亚平,彭 雅,等.基于最小类差异的无关信息预处理算法.电子学报, 2003, 31(11): 1750-1753) [7]John G H, Kohavi R, Pfleger K. Irrelevant Features and the Subset Selection Problem. In: Proc of the 11th International Conference on Machine Learning. New Brunswick, USA, 1994, 121-129 [8]Soucy P, Mineau P. A Simple Feature Selection Method for Text Classification. In: Proc of the 17th International Joint Conference on Artificial Intelligence. Seattle, USA, 2001, 897-902 [9] Yang Y M, Liu X. A Re-Examination of Text Categorization Methods. In: Proc of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Berkeley, USA, 1999, 42-49 [10]Zhou S G. The Key Techniques Research for Chinese Text Database. Ph.D Dissertation. College of Information, Fudan University, Shanghai, China, 2000 (in Chinese) (周水庚.中文文本数据库若干关键技术研究.博士学位论文.复旦大学,信息学院,上海, 2000)