A Multivariate Discretization Method for Continuous Attributes
HOU Ju-Chi1, LIANG Ying2, REN Chang-Zhi3
1.Department of Student Affairs, Hebei University of Engineering, Hebei 056038 2.Department of Social Work and Social Policy, School of Social and Behavioral Sciences, Nanjing University, Nanjing 210093 3.Department of Precision Instruments and Mechanology, Tsinghua University, Beijing 100084
Abstract:Currently, most discretization methods only consider a single variable, which can not get optimal discretization scheme. Taking the relationship among multi-attributes into account, a data discretization method is proposed. A multivariate discretization measurement criterion is presented by means of probabilistic model selection and minimum description length principle (MDLP). An efficient heuristic algorithm is proposed to get the best discretization scheme based on the proposed criterion. Nine UCI datasets are classified and predicted. Experimental results show that the proposed method significantly enhances the learning accuracy of Nave Bayes classifier.
[1] Wu Xiaodong, Vipin K, Quinlan J R, et al. Top 10 Algorithms in Data Mining. Knowledge Information System, 2008, 14(1): 1-37 [2] Dougherty J, Kohavi R, Sahami M. Supervised and Unsupervised Discretization of Continuous Feature // Proc of the 12th International Conference on Machine Learning. Edinburgh, UK, 1995: 194-202 [3] Liu Huan, Setiono R. Feature Selection via Discretization. IEEE Trans on Knowledge and Data Engineering, 1997, 9(4): 642-645 [4] Su C T, Hsu J H. An Extended Chi2 Algorithm for Discretization of Real Value Attributes. IEEE Trans on Knowledge and Data Engineering, 2005, 17(3): 437-441 [5] Sang Yu, Yan Deqin, Liang Hongxia, et al. Modification to Algorithms of the Series of Chi2 Algorithm. Journal of Chinese Computer Systems, 2009, 30(3): 524-529 (in Chinese) (桑 雨,闫德勤,梁宏霞,等.对Chi2系列算法的改进方法.小型微型计算机系统, 2009, 30(3): 524-529) [6] Fayyad U, Irani K. Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning // Proc of the 13th International Joint Conference on Artificial Intelligence. Chambery, France, 1993: 1022-1027 [7] Xie Hong, Cheng Haozhong, Niu Dongxiao. Discretization of Continuous Attributes in Rough Set Theory Based on Information Entropy. Chinese Journal of Computers, 2005, 28(9): 1570-1574 (in Chinese) (谢 宏,程浩忠,牛东晓.基于信息熵的粗糙集连续属性离散化算法.计算机学报, 2005, 28(9): 1570-1574) [8] Kurgan L A, Cios K J. CAIM Discretization Algorithm. IEEE Trans on Knowledge and Data Engineering, 2004, 16(2): 145-153 [9] Tai C J, Lee C I, Yang W P. A Discretization Algorithm Based on Class-Attribute Contingency Coefficient. Information Sciences, 2008, 178(3): 714-731. [10] Li Gang.An Unsupervised Discretization Algorithm Based on Mixture Probabilistic Model. Chinese Journal of Computers, 2002, 25(2): 158-164 (in Chinese) (李 刚.基于混合概率模型的无监督离散化算法.计算机学报, 2002, 25(2): 158-164) [11] Ruiz F J, Angulo C, Agell N. IDD: A Supervised Interval Distance-Based Method for Discretization. IEEE Trans on Knowledge and Data Engineering, 2008, 20(9): 1230-1238 [12] Jin Ruoming, Breitbart Y, Muoh C. Data Discretization Unification. Knowledge and Information System, 2008, 14(1): 115-142 [13] Hansen M H, Yu Bin. Model Selection and the Principle of Minimum Description Length. Journal of the American Statistical Association, 2001, 96(454): 746-774 [14] Fazlollah M R. An Introduction to Information Theory. New York, USA: Dover Publications, 1994 [15] Mussard S, Seyte F, Terraza M. Decomposition of Gini and the Generalized Entropy Inequality Measures. Economic Bulletin, 2003, 4(7): 1-6 [16] Pawlak Z. Rough Sets. International Journal of Computer and Information Sciences, 1982, 11(5): 341-356 [17] Li Linshu. Probability and Mathematical Statistics. Beijing, China: China Renmin University Press, 2006 (in Chinese) (李林曙. 概率论与数理统计. 北京: 中国人民大学出版社, 2006) [18] Hsu C N, Huang H J, Wong T T. Why Discretization Works for Nave Bayesian Classifiers // Proc of the 17th International Conference on Machine Learning. Stanford, USA, 2000: 309-406