多变量连续属性离散化方法

摘要
图/表
参考文献
相关文章 (15)

全文: PDF (538 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要目前很多离散化方法仅考虑单个变量，不能得到最优的离散化方案。文中提出一种多属性关系的数据离散化方法。凭借概率的模型选择和最小描述长度原理，获得多变量离散化衡量标准，基于该标准提出一种有效的启发式算法来寻找最好的离散化方案。对UCI数据集进行分类预测，实验结果表明该方法提高Nave贝叶斯分类器的学习精度。

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	侯居茌
	梁莹
	任长志

关键词 ：数据挖掘, 多变量离散化, 最小描述长度原理(MDLP), Nave贝叶斯分类器

Abstract：Currently, most discretization methods only consider a single variable, which can not get optimal discretization scheme. Taking the relationship among multi-attributes into account, a data discretization method is proposed. A multivariate discretization measurement criterion is presented by means of probabilistic model selection and minimum description length principle (MDLP). An efficient heuristic algorithm is proposed to get the best discretization scheme based on the proposed criterion. Nine UCI datasets are classified and predicted. Experimental results show that the proposed method significantly enhances the learning accuracy of Nave Bayes classifier.

Key words： Data Mining Multivariate Discretization Minimum Description Length Principle (MDLP) Nave Bayes Classifier

收稿日期: 2010-12-06

ZTFLH:

TP181

基金资助:国家自然科学基金项目(No.71173099)、国家自然科学基金青年项目(No.70903002)资助

作者简介: 侯居茌，男，1976年生，讲师，主要研究方向为数据挖掘、模式识别等.梁莹，女，1979年生，副教授，主要研究方向为公共管理、数据挖掘等.E-mail:njulucy@163.com.任长志，男，1977年生，博士后，副研究员，主要研究方向为智能控制、数据挖掘、模式识别等.

引用本文:

侯居茌，梁莹，任长志. 多变量连续属性离散化方法[J]. 模式识别与人工智能, 2011, 24(6): 792-797. HOU Ju-Chi, LIANG Ying, REN Chang-Zhi. A Multivariate Discretization Method for Continuous Attributes. , 2011, 24(6): 792-797.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/ 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2011/V24/I6/792

[1] Wu Xiaodong, Vipin K, Quinlan J R, et al. Top 10 Algorithms in Data Mining. Knowledge Information System, 2008, 14(1): 1-37
[2] Dougherty J, Kohavi R, Sahami M. Supervised and Unsupervised Discretization of Continuous Feature // Proc of the 12th International Conference on Machine Learning. Edinburgh, UK, 1995: 194-202
[3] Liu Huan, Setiono R. Feature Selection via Discretization. IEEE Trans on Knowledge and Data Engineering, 1997, 9(4): 642-645
[4] Su C T, Hsu J H. An Extended Chi2 Algorithm for Discretization of Real Value Attributes. IEEE Trans on Knowledge and Data Engineering, 2005, 17(3): 437-441
[5] Sang Yu, Yan Deqin, Liang Hongxia, et al. Modification to Algorithms of the Series of Chi2 Algorithm. Journal of Chinese Computer Systems, 2009, 30(3): 524-529 (in Chinese)
(桑雨,闫德勤,梁宏霞,等.对Chi2系列算法的改进方法.小型微型计算机系统, 2009, 30(3): 524-529)
[6] Fayyad U, Irani K. Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning // Proc of the 13th International Joint Conference on Artificial Intelligence. Chambery, France, 1993: 1022-1027
[7] Xie Hong, Cheng Haozhong, Niu Dongxiao. Discretization of Continuous Attributes in Rough Set Theory Based on Information Entropy. Chinese Journal of Computers, 2005, 28(9): 1570-1574 (in Chinese)
(谢宏,程浩忠,牛东晓.基于信息熵的粗糙集连续属性离散化算法.计算机学报, 2005, 28(9): 1570-1574)
[8] Kurgan L A, Cios K J. CAIM Discretization Algorithm. IEEE Trans on Knowledge and Data Engineering, 2004, 16(2): 145-153
[9] Tai C J, Lee C I, Yang W P. A Discretization Algorithm Based on Class-Attribute Contingency Coefficient. Information Sciences, 2008, 178(3): 714-731.
[10] Li Gang.An Unsupervised Discretization Algorithm Based on Mixture Probabilistic Model. Chinese Journal of Computers, 2002, 25(2): 158-164 (in Chinese)
(李刚.基于混合概率模型的无监督离散化算法.计算机学报, 2002, 25(2): 158-164)
[11] Ruiz F J, Angulo C, Agell N. IDD: A Supervised Interval Distance-Based Method for Discretization. IEEE Trans on Knowledge and Data Engineering, 2008, 20(9): 1230-1238
[12] Jin Ruoming, Breitbart Y, Muoh C. Data Discretization Unification. Knowledge and Information System, 2008, 14(1): 115-142
[13] Hansen M H, Yu Bin. Model Selection and the Principle of Minimum Description Length. Journal of the American Statistical Association, 2001, 96(454): 746-774
[14] Fazlollah M R. An Introduction to Information Theory. New York, USA: Dover Publications, 1994
[15] Mussard S, Seyte F, Terraza M. Decomposition of Gini and the Generalized Entropy Inequality Measures. Economic Bulletin, 2003, 4(7): 1-6
[16] Pawlak Z. Rough Sets. International Journal of Computer and Information Sciences, 1982, 11(5): 341-356
[17] Li Linshu. Probability and Mathematical Statistics. Beijing, China: China Renmin University Press, 2006 (in Chinese)
(李林曙. 概率论与数理统计. 北京: 中国人民大学出版社, 2006)
[18] Hsu C N, Huang H J, Wong T T. Why Discretization Works for Nave Bayesian Classifiers // Proc of the 17th International Conference on Machine Learning. Stanford, USA, 2000: 309-406