基于属性树的并行化增量式动态属性约简算法

doi:10.16451/j.cnki.issn1003-6059.202210007

摘要
图/表
参考文献
相关文章 (15)

全文: PDF (1104 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要传统增量算法主要侧重于从更新近似的角度进行属性约简,但在处理大规模数据集时需要评估所有属性并反复计算重要度,提升时间复杂度,降低效率.针对上述问题,文中提出基于属性树的并行化增量加速策略,关键步骤是将所有属性聚类成多棵属性树进行并行化动态属性评估.首先,根据属性树相关性度量选择合适的属性树进行属性评估,降低时间复杂度.再增加分支系数至停止准则中,随着分支深度的增加进行动态增加,可使算法在达到最大阈值后自主跳出循环,有效避免原先的冗余计算,提高效率.基于上述策略,提出基于属性树的增量属性约简算法,同时与Spark并行机制结合,设计基于属性树的并行化增量式动态属性约简算法.在多个数据集上的广泛实验表明,文中算法在保持分类性能的同时,可提高动态变化数据集约简的搜索效率,具有较好的性能优势.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	秦廷桢
	丁卫平
	鞠恒荣
	李铭
	黄嘉爽
	陈悦鹏
	王海鹏

关键词 ：属性约简, 属性树, 知识粒度, 并行计算, 增量学习, Spark框架

Abstract：Traditional incremental methods mainly focus on the attribute reduction from the perspective of updating approximation. However, while processing large-scale data sets, the methods need to evaluate all attributes and calculate importance repeatedly. Thus, time complexity is increased and efficiency is decreased. To solve the problems, an incremental acceleration strategy for parallelization based on attribute tree is proposed. The key step is to cluster all attributes into multiple attribute trees for parallel dynamic attribute evaluation. Firstly, an appropriate attribute tree is selected for attribute evaluation according to the attribute tree correlation measure to reduce the time complexity. Then, the branch coefficient is added to the stop criterion, and the dynamic increase is conducted with the increase of the branch depth. Consequently, the algorithm can jump out of the cycle automatically after reaching the maximum threshold to avoid the original redundant calculation and improve the efficiency effectively. Based on the above improvements, an incremental dynamic attribute reduction algorithm based on attribute tree is proposed, and a parallel incremental dynamic attribute reduction algorithm based on attribute tree is designed by being combined with Spark parallel mechanism. Finally, experiments on multiple datasets show that the proposed algorithm improves the search efficiency of dynamically variational dataset reduction significantly while maintaining the classification performance, holding a better performance advantage.

Key words： Attribute Reduction Attribute Tree Knowledge Granularity Parallel Computing Incremental Learning Spark Framework

收稿日期: 2022-09-19

ZTFLH:

TP 18

基金资助:国家自然科学基金项目(No.61976120,62006128,62102199)、江苏省自然科学基金项目(No.BK20191445)、江苏省双创博士计划(No.(2020)30986)、江苏省高等学校自然科学研究重大项目(No.21KJA510004)、江苏省高等学校自然科学研究面上项目(No.20KJB520009)、南通市科技局基础科学研究项目(No.JC2020141,JC2021122)、中国博士后科学基金项目(No.2022M711716)、教育部人文社会科学研究青年基金项目(No.21YJCZH013)、江苏省研究生科研与实践创新计划(No.SJCX21_1446)资助

通讯作者: 丁卫平,博士,教授,主要研究方向为数据挖掘、机器学习、粒计算、演化计算、大数据分析等.E-mail:dwp9988@163.com.

作者简介: 秦廷桢,硕士研究生,主要研究方向为数据挖掘、粒计算、粗糙集.E-mail:qtz517469152@163.com. 鞠恒荣,博士,副教授,主要研究方向为粒计算、粗糙集、机器学习、知识发现.E-mail:juhengrong@ntu.edu.cn. 李铭,硕士,主要研究方向为数据挖掘、粒度计算、大数据分析.E-mail:1930310041@stmail.ntu.edu.cn. 黄嘉爽,博士,副教授,主要研究方向为脑网络分析、深度学习.E-mail:hjshdym@163.com. 陈悦鹏,博士研究生,主要研究方向为粒计算、粗糙集、机器学习.E-mail:15850623872@163.com. 王海鹏,硕士研究生,主要研究方向为模糊理论、粒度计算、深度学习.E-mail:1781053215@qq.com.

引用本文:

秦廷桢, 丁卫平, 鞠恒荣, 李铭, 黄嘉爽, 陈悦鹏, 王海鹏. 基于属性树的并行化增量式动态属性约简算法[J]. 模式识别与人工智能, 2022, 35(10): 939-951. QIN Tingzhen, DING Weiping, JU Hengrong, LI Ming, HUANG Jiashuang, CHEN Yuepeng, WANG Haipeng. Parallel Incremental Dynamic Attribute Reduction Algorithm Based on Attribute Tree. Pattern Recognition and Artificial Intelligence, 2022, 35(10): 939-951.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/10.16451/j.cnki.issn1003-6059.202210007 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2022/V35/I10/939

[1] CHEN M, MAO S W, LIU Y H.Big Data: A Survey. Mobile Network and Applications, 2014, 19(2): 171-209.
[2] TANG J J, LIANG J, ZHANG S, et al. Inferring Driving Trajectories Based on Probabilistic Model from Large Scale Taxi GPS Data. Physica A(Statistical Mechanics and Its Applications), 2018, 506: 566-577.
[3] HAN L X, LIEW C S, VAN HEMERT J, et al. A Generic Parallel Processing Model for Facilitating Data Mining and Integration. Para-llel Computing, 2011, 37(3): 157-171.
[4] SRINIVASAN A, FARUQUIE T A, JOSHI S.Data and Task Para-llelism in ILP Using MapReduce. Machine Learning, 2012, 86(1): 141-168.
[5] PAWLAK Z. RoughSets. International Journal of Computer and Information Science, 1982, 11(5): 341-356.
[6] HASSANIEN A E, ABRAHAM A, PETERS J F, et al. Rough Sets and Near Sets in Medical Imaging: A Review. IEEE Transactions on Information Technology in Biomedicine, 2009, 13(6): 955-968.
[7] HERAWAN T, DERIS M M, ABAWAJY J H.A Rough Set Approach for Selecting Clustering Attribute. Knowledge-Based Systems, 2010, 23(3): 220-231.
[8] SWINIARSKI R W, SKOWRON A.Rough Set Methods in Feature Selection and Recognition. Pattern Recognition Letters, 2003, 24(6): 833-849.
[9] LINGRAS P, CHEN M, MIAO D Q.Rough Cluster Quality Index Based on Decision Theory. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(7): 1014-1026.
[10] RIZA L S, JANUSZ A, BERGMEIR C, et al. Implementing Algorithms of Rough Set Theory and Fuzzy Rough Set Theory in the R Package "RoughSets". Information Sciences, 2014, 287: 68-89.
[11] CHEN Y M, MIAO D Q, WANG R Z, et al. A Rough Set Approach to Feature Selection Based on Power Set Tree. Know-ledge-Based Systems, 2011, 24(2): 275-281.
[12] ANANTHANARAYANA V S, MURTY M N, SUBRAMANIAN D K.Tree Structure for Efficient Data Mining Using Rough Sets. Pattern Recognition Letters, 2003, 24(6): 851-862.
[13] YAO Y Y, ZHAO Y.Attribute Reduction in Decision-Theoretic Rough Set Models. Information Sciences, 2008, 178(17): 3356-3373.
[14] YAO Y Y.Three-Way Decisions with Probabilistic Rough Sets. Information Sciences, 2010, 180(3): 341-353.
[15] 徐章艳,刘作鹏,杨炳儒,等.一个复杂度为max(O(|C||U|),O(|C|²|U/C|))的快速属性约简算法.计算机学报, 2006, 29(3): 391-399.
(XU Z Y, LIU Z P, YANG B R, et al. A Quick Attribute Reduction Algorithm with Complexity of max(O(|C||U|),O(|C|²·|U/C|)). Chinese Journal of Computers, 2006, 29(3): 391-399.)
[16] QIAN Y H, LIANG J Y, PEDRYCZ W, et al. Positive Approximation: An Accelerator for Attribute Reduction in Rough Set Theory. Artificial Intelligence, 2010, 174(9/10): 597-618.
[17] 苗夺谦,胡桂荣.知识约简的一种启发式算法.计算机研究与发展, 1999, 36(6): 681-684.
(MIAO D Q, HU G R.A Heuristic Algorithm for Reduction of Knowledge. Journal of Computer Research and Development, 1999, 36(6): 681-684.)
[18] 王国胤,于洪,杨大春.基于条件信息熵的决策表约简.计算机学报, 2002, 25(7): 759-766.
(WANG G Y, YU H, YANG D C.Decision Table Reduction Based on Conditional Information Entropy. Chinese Journal of Computers, 2002, 25(7): 759-766.)
[19] SKOWRON A, RAUSZER C.The Discernibility Matrices and Fun-ctions in Information System// SLOWINSKI R, ed. Intelligent De-cision Support. Berlin, Germany: Springer, 1992: 331-362.
[20] QIAN J, MIAO D Q, ZHANG Z H, et al. Hybrid Approaches to Attribute Reduction Based on Indiscernibility and Discernibility Relation. International Journal of Approximate Reasoning, 2011, 52(2): 212-230.
[21] LIANG J Y, WANG F, DANG C Y, et al. An Efficient Rough Feature Selection Algorithm with a Multi-granulation View. International Journal of Approximate Reasoning, 2012, 53(6): 912-926.
[22] LI M J, YU X M, RYU K H.MapReduce-Based Web Mining for Prediction of Web-User Navigation. Journal of Information Science, 2014, 40(5): 557-567.
[23] ZHANG J B, LI T R, PAN Y.Parallel Large-Scale Attribute Reduction on Cloud Systems[C/OL]. [2022-08-20].https://arxiv.org/pdf/1610.01807.pdf.
[24] RAZA M S, QAMAR U.A Parallel Rough Set Based Dependency Calculation Method for Efficient Feature Selection. Applied Soft Computing, 2018, 71: 1020-1034.
[25] QIAN J, MIAO D Q, ZHANG Z H, et al. Parallel Attribute Reduction Algorithms Using MapReduce. Information Sciences, 2014, 279: 671-690.
[26] QIAN J, XIA M, YUE X D.Parallel Knowledge Acquisition Algorithms for Big Data Using MapReduce. International Journal of Machine Learning and Cybernetics, 2018, 9(6): 1007-1021.
[27] 王磊,李天瑞.一种基于矩阵的知识粒度计算方法.模式识别与人工智能, 2013, 26(5): 447-453.
(WANG L, LI T R.A Matrix-Based Approach for Calculation of Knowledge Granulation. Pattern Recognition and Artificial Intelligence, 2013, 26(5): 447-453.)
[28] 徐久成,史进玲,孙林.一种基于相对粒度的决策表约简算法. 计算机科学, 2009, 36(3): 205-207.
(XU J C, SHI J L, SUN L.Attribute Reduction Algorithm Based on Relative Granularity in Decision Tables. Computer Science, 2009, 36(3): 205-207.)
[29] SHU W H, SHEN H.Incremental Feature Selection Based on Rough Set in Dynamic Incomplete Data. Pattern Recognition, 2014, 47(12): 3890-3906.
[30] LIU D, LI T R, ZHANG J B.A Rough Set-Based Incremental Approach for Learning Knowledge in Dynamic Incomplete Information Systems. International Journal of Approximate Reasoning, 2014, 55(8): 1764-1786.
[31] JING Y G, LI T R, LUO C, et al. An Incremental Approach for Attribute Reduction Based on Knowledge Granularity. Knowledge-Based Systems, 2016, 104: 24-38.
[32] 钱进,苗夺谦,张泽华.云计算环境下知识约简算法.计算机学报, 2011, 34(12): 2332-2343.
(QIAN J, MIAO D Q, ZHANG Z H.Knowledge Reduction Algorithms in Cloud Computing. Chinese Journal of Computers, 2011, 34(12): 2332-2343.)
[33] LIANG J Y, WANG F, DANG C Y, et al. A Group Incremental Approach to Feature Selection Applying Rough Set Technique. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(2): 294-308.
[34] CHEN Y, LIU K Y, SONG J J, et al. Attribute Group for Attri-bute Reduction. Information Sciences, 2020, 535: 64-80.
[35] PAWLAK Z, SKOWRON A.Rudiments of Rough Sets. Information Sciences, 2007, 177(1): 3-27.
[36] DEAN J, GHEMAWAT S.MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 2008, 51(1): 107-113.
[37] 吴信东,嵇圣硙.MapReduce与Spark用于大数据分析之比较.软件学报, 2018, 29(6): 1770-1791.
(WU X D, JI S W.Comparative Study on MapReduce and Spark for Big Data Analytics. Journal of Software, 2018, 29(6): 1770-1791.)
[38] YIN L Z, QIN L Y, JIANG Z H, et al. A Fast Parallel Attribute Reduction Algorithm Using Apache Spark. Knowledge-Based Systems, 2021, 212. DOI: 10.1016/j.knosys.2020.106582.
[39] KARUN A K, CHITHARANJAN K.A Review on Hadoop-HDFS Infrastructure Extensions// Proc of the IEEE Conference on Information and Communication Technologies. Washington, USA: IEEE, 2013: 132-137.
[40] KODINARIYA T M, MAKWANA P R.Review on Determining Number of Cluster in K-means Clustering. International Journal of Advance Research in Computer Science and Management Studies, 2013, 1(6): 90-95.
[41] PAWLAK Z.Rough Sets: Theoretical Aspects of Reasoning about Data. Berlin, Germany: Springer, 1991.
[42] SHU W H, QIAN W B.An Incremental Approach to Attribute Reduction from Dynamic Incomplete Decision Systems in Rough Set Theory. Data and Knowledge Engineering, 2015, 100: 116-132.
[43] WEI W, SONG P, LIANG J Y, et al. Accelerating Incremental Attribute Reduction Algorithm by Compacting a Decision Table. International Journal of Machine Learning and Cybernetics, 2019, 10(9): 2355-2373.