Abstract:Traditional incremental methods mainly focus on the attribute reduction from the perspective of updating approximation. However, while processing large-scale data sets, the methods need to evaluate all attributes and calculate importance repeatedly. Thus, time complexity is increased and efficiency is decreased. To solve the problems, an incremental acceleration strategy for parallelization based on attribute tree is proposed. The key step is to cluster all attributes into multiple attribute trees for parallel dynamic attribute evaluation. Firstly, an appropriate attribute tree is selected for attribute evaluation according to the attribute tree correlation measure to reduce the time complexity. Then, the branch coefficient is added to the stop criterion, and the dynamic increase is conducted with the increase of the branch depth. Consequently, the algorithm can jump out of the cycle automatically after reaching the maximum threshold to avoid the original redundant calculation and improve the efficiency effectively. Based on the above improvements, an incremental dynamic attribute reduction algorithm based on attribute tree is proposed, and a parallel incremental dynamic attribute reduction algorithm based on attribute tree is designed by being combined with Spark parallel mechanism. Finally, experiments on multiple datasets show that the proposed algorithm improves the search efficiency of dynamically variational dataset reduction significantly while maintaining the classification performance, holding a better performance advantage.
[1] CHEN M, MAO S W, LIU Y H.Big Data: A Survey. Mobile Network and Applications, 2014, 19(2): 171-209. [2] TANG J J, LIANG J, ZHANG S, et al. Inferring Driving Trajectories Based on Probabilistic Model from Large Scale Taxi GPS Data. Physica A(Statistical Mechanics and Its Applications), 2018, 506: 566-577. [3] HAN L X, LIEW C S, VAN HEMERT J, et al. A Generic Parallel Processing Model for Facilitating Data Mining and Integration. Para-llel Computing, 2011, 37(3): 157-171. [4] SRINIVASAN A, FARUQUIE T A, JOSHI S.Data and Task Para-llelism in ILP Using MapReduce. Machine Learning, 2012, 86(1): 141-168. [5] PAWLAK Z. RoughSets. International Journal of Computer and Information Science, 1982, 11(5): 341-356. [6] HASSANIEN A E, ABRAHAM A, PETERS J F, et al. Rough Sets and Near Sets in Medical Imaging: A Review. IEEE Transactions on Information Technology in Biomedicine, 2009, 13(6): 955-968. [7] HERAWAN T, DERIS M M, ABAWAJY J H.A Rough Set Approach for Selecting Clustering Attribute. Knowledge-Based Systems, 2010, 23(3): 220-231. [8] SWINIARSKI R W, SKOWRON A.Rough Set Methods in Feature Selection and Recognition. Pattern Recognition Letters, 2003, 24(6): 833-849. [9] LINGRAS P, CHEN M, MIAO D Q.Rough Cluster Quality Index Based on Decision Theory. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(7): 1014-1026. [10] RIZA L S, JANUSZ A, BERGMEIR C, et al. Implementing Algorithms of Rough Set Theory and Fuzzy Rough Set Theory in the R Package "RoughSets". Information Sciences, 2014, 287: 68-89. [11] CHEN Y M, MIAO D Q, WANG R Z, et al. A Rough Set Approach to Feature Selection Based on Power Set Tree. Know-ledge-Based Systems, 2011, 24(2): 275-281. [12] ANANTHANARAYANA V S, MURTY M N, SUBRAMANIAN D K.Tree Structure for Efficient Data Mining Using Rough Sets. Pattern Recognition Letters, 2003, 24(6): 851-862. [13] YAO Y Y, ZHAO Y.Attribute Reduction in Decision-Theoretic Rough Set Models. Information Sciences, 2008, 178(17): 3356-3373. [14] YAO Y Y.Three-Way Decisions with Probabilistic Rough Sets. Information Sciences, 2010, 180(3): 341-353. [15] 徐章艳,刘作鹏,杨炳儒,等.一个复杂度为max(O(|C||U|),O(|C|2|U/C|))的快速属性约简算法.计算机学报, 2006, 29(3): 391-399. (XU Z Y, LIU Z P, YANG B R, et al. A Quick Attribute Reduction Algorithm with Complexity of max(O(|C||U|),O(|C|2·|U/C|)). Chinese Journal of Computers, 2006, 29(3): 391-399.) [16] QIAN Y H, LIANG J Y, PEDRYCZ W, et al. Positive Approximation: An Accelerator for Attribute Reduction in Rough Set Theory. Artificial Intelligence, 2010, 174(9/10): 597-618. [17] 苗夺谦,胡桂荣.知识约简的一种启发式算法.计算机研究与发展, 1999, 36(6): 681-684. (MIAO D Q, HU G R.A Heuristic Algorithm for Reduction of Knowledge. Journal of Computer Research and Development, 1999, 36(6): 681-684.) [18] 王国胤,于洪,杨大春.基于条件信息熵的决策表约简.计算机学报, 2002, 25(7): 759-766. (WANG G Y, YU H, YANG D C.Decision Table Reduction Based on Conditional Information Entropy. Chinese Journal of Computers, 2002, 25(7): 759-766.) [19] SKOWRON A, RAUSZER C.The Discernibility Matrices and Fun-ctions in Information System// SLOWINSKI R, ed. Intelligent De-cision Support. Berlin, Germany: Springer, 1992: 331-362. [20] QIAN J, MIAO D Q, ZHANG Z H, et al. Hybrid Approaches to Attribute Reduction Based on Indiscernibility and Discernibility Relation. International Journal of Approximate Reasoning, 2011, 52(2): 212-230. [21] LIANG J Y, WANG F, DANG C Y, et al. An Efficient Rough Feature Selection Algorithm with a Multi-granulation View. International Journal of Approximate Reasoning, 2012, 53(6): 912-926. [22] LI M J, YU X M, RYU K H.MapReduce-Based Web Mining for Prediction of Web-User Navigation. Journal of Information Science, 2014, 40(5): 557-567. [23] ZHANG J B, LI T R, PAN Y.Parallel Large-Scale Attribute Reduction on Cloud Systems[C/OL]. [2022-08-20].https://arxiv.org/pdf/1610.01807.pdf. [24] RAZA M S, QAMAR U.A Parallel Rough Set Based Dependency Calculation Method for Efficient Feature Selection. Applied Soft Computing, 2018, 71: 1020-1034. [25] QIAN J, MIAO D Q, ZHANG Z H, et al. Parallel Attribute Reduction Algorithms Using MapReduce. Information Sciences, 2014, 279: 671-690. [26] QIAN J, XIA M, YUE X D.Parallel Knowledge Acquisition Algorithms for Big Data Using MapReduce. International Journal of Machine Learning and Cybernetics, 2018, 9(6): 1007-1021. [27] 王磊,李天瑞.一种基于矩阵的知识粒度计算方法.模式识别与人工智能, 2013, 26(5): 447-453. (WANG L, LI T R.A Matrix-Based Approach for Calculation of Knowledge Granulation. Pattern Recognition and Artificial Intelligence, 2013, 26(5): 447-453.) [28] 徐久成,史进玲,孙林.一种基于相对粒度的决策表约简算法. 计算机科学, 2009, 36(3): 205-207. (XU J C, SHI J L, SUN L.Attribute Reduction Algorithm Based on Relative Granularity in Decision Tables. Computer Science, 2009, 36(3): 205-207.) [29] SHU W H, SHEN H.Incremental Feature Selection Based on Rough Set in Dynamic Incomplete Data. Pattern Recognition, 2014, 47(12): 3890-3906. [30] LIU D, LI T R, ZHANG J B.A Rough Set-Based Incremental Approach for Learning Knowledge in Dynamic Incomplete Information Systems. International Journal of Approximate Reasoning, 2014, 55(8): 1764-1786. [31] JING Y G, LI T R, LUO C, et al. An Incremental Approach for Attribute Reduction Based on Knowledge Granularity. Knowledge-Based Systems, 2016, 104: 24-38. [32] 钱进,苗夺谦,张泽华.云计算环境下知识约简算法.计算机学报, 2011, 34(12): 2332-2343. (QIAN J, MIAO D Q, ZHANG Z H.Knowledge Reduction Algorithms in Cloud Computing. Chinese Journal of Computers, 2011, 34(12): 2332-2343.) [33] LIANG J Y, WANG F, DANG C Y, et al. A Group Incremental Approach to Feature Selection Applying Rough Set Technique. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(2): 294-308. [34] CHEN Y, LIU K Y, SONG J J, et al. Attribute Group for Attri-bute Reduction. Information Sciences, 2020, 535: 64-80. [35] PAWLAK Z, SKOWRON A.Rudiments of Rough Sets. Information Sciences, 2007, 177(1): 3-27. [36] DEAN J, GHEMAWAT S.MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 2008, 51(1): 107-113. [37] 吴信东,嵇圣硙.MapReduce与Spark用于大数据分析之比较.软件学报, 2018, 29(6): 1770-1791. (WU X D, JI S W.Comparative Study on MapReduce and Spark for Big Data Analytics. Journal of Software, 2018, 29(6): 1770-1791.) [38] YIN L Z, QIN L Y, JIANG Z H, et al. A Fast Parallel Attribute Reduction Algorithm Using Apache Spark. Knowledge-Based Systems, 2021, 212. DOI: 10.1016/j.knosys.2020.106582. [39] KARUN A K, CHITHARANJAN K.A Review on Hadoop-HDFS Infrastructure Extensions// Proc of the IEEE Conference on Information and Communication Technologies. Washington, USA: IEEE, 2013: 132-137. [40] KODINARIYA T M, MAKWANA P R.Review on Determining Number of Cluster in K-means Clustering. International Journal of Advance Research in Computer Science and Management Studies, 2013, 1(6): 90-95. [41] PAWLAK Z.Rough Sets: Theoretical Aspects of Reasoning about Data. Berlin, Germany: Springer, 1991. [42] SHU W H, QIAN W B.An Incremental Approach to Attribute Reduction from Dynamic Incomplete Decision Systems in Rough Set Theory. Data and Knowledge Engineering, 2015, 100: 116-132. [43] WEI W, SONG P, LIANG J Y, et al. Accelerating Incremental Attribute Reduction Algorithm by Compacting a Decision Table. International Journal of Machine Learning and Cybernetics, 2019, 10(9): 2355-2373.