|
|
Online Streaming Feature Selection for High-Dimensional and Class-Imbalanced Data Based on Max-Decision Boundary |
LIN Yaojin1,2, CHEN Xiangyan1,2, BAI Shengxing1,2, WANG Chenxi1,2 |
1. School of Computer Science and Engineering, Minnan Normal University, Zhangzhou 363000 2. Key Laboratory of Data Science and Intelligence Application, The Education Department of Fujian Province, Minnan Normal University, Zhangzhou 363000 |
|
|
Abstract The feature space of data changes with time dynamically. The number of features on training data is high-dimensional and fixed, and the label space is imbalanced. Motivated by the above, an online streaming feature selection algorithm for high-dimensional and class-imbalanced data based on max-decision boundary is proposed. An adaptive neighborhood relation is defined with consideration of the effect of boundary samples based on neighborhood rough set, and then a rough dependency calculation formula with respect to max-decision boundary is designed. Meanwhile, three online feature subset evaluation metrics are proposed to select features with great discriminability in majority and minority classes. Experiments on eleven high-dimensional and class-imbalanced datasets indicate that the proposed method achieves better performance than some state-of-the-art online streaming feature selection algorithms.
|
Received: 01 July 2020
|
|
Fund:National Natural Science Foundation of China(No.61672272), Natural Science Foundation of Fujian Province(No.2018J01548,2018J01547), Science and Technology Project of the Education Department of Fujian Province(No.JAT180318) |
Corresponding Authors:
LIN Yaojin, Ph.D., professor. His research interests include data mining and machine learning.
|
About author:: CHEN Xiangyan, master student. His research interests include data mining.BAI Shengxing, master student. His research interests include data mining.WANG Chenxi, master, lecturer. Her research interests include data mining. |
|
|
|
[1] DUDEK G. Artificial Immune System with Local Feature Selection for Short-Term Load Forecasting. IEEE Transactions on Evolutionary Computation, 2017, 21(1): 116-130. [2] ROBNIK-SˇIKONJA M, KNONNENKO I. Theoretical and Empirical Analysis of ReliefF and RReliefF. Machine Learning, 2003, 53: 23-69. [3] DING W, STEPINSKI T F, MU Y, et al. Subkilometer Crater Discovery with Boosting and Transfer Learning. ACM Transactions on Intelligent Systems and Technology, 2011, 2(4): 39:1-39:22. [4] YU K, DING W, WU X D. LOFS: A Library of Online Streaming Features Selection. Knowledge-Based System, 2016, 113: 1-3. [5] 陈祥焰,林耀进,王晨曦.基于邻域粗糙集的高维类不平衡数据在线流特征选择.模式识别与人工智能, 2019, 32(8): 726-735. (CHEN X Y, LIN Y J, WANG C X. Online Streaming Feature Selection for High-Dimensional and Class-Imbalanced Data Based on Neighborhood Rough Set. Pattern Recognition and Artificial Intelligence, 2019, 32(8): 726-735.) [6] 刘景华,林梦雷,王晨曦,等.基于局部子空间的多标记特征选择算法.模式识别与人工智能, 2016, 29(3): 240-251. (LIU J H, LIN M L, WANG C X, et al. Multi-label Feature Selection Algorithm Based on Local Subspace. Pattern Recognition and Artificial Intelligence, 2016, 29(3): 240-251.) [7] WANG C X, LIN Y J, LIU J H. Feature Selection for Multi-label Learning with Missing Labels. Applied Intelligence, 2019, 49(8): 3027-3042. [8] ZHOU P, HU X G, LI P P, et al. Online Feature Selection for High-Dimensional Class-Imbalanced Data. Knowledge-Based Systems, 2017, 136: 187-199. [9] LIU J H, LIN Y J, LI Y W, et al. Online Multi-label Streaming Feature Selection Based on Neighborhood Rough Set. Pattern Recognition, 2018, 84: 273-287. [10] LIN Y K, HU Q H, ZHANG J, et al. Multi-label Feature Selection with Streaming Labels. Information Sciences, 2016, 372: 256-275. [11] PERKINS S, THEILER J. Online Feature Selection Using Grafting // Proc of the 20th International Conference on Machine Learning. Berlin, Germany: Springer, 2003: 592-599. [12] ZHOU J, FOSTER D P, STINE R A, et al. Streamwise Feature Selection. Journal of Machine Learning Research, 2006, 7: 1861-1885. [13] WU X D, YU K, DING W, et al. Online Feature Selection with Streaming Features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(5): 1178-1192. [14] ZHOU P, HU X G, LI P P, et al. OFS-Density: A Novel Online Streaming Feature Selection Method. Pattern Recognition, 2019, 86: 48-61. [15] YU K, WU X D, DING W, et al. Towards Scalable and Accurate Online Feature Selection for Big Data // Proc of the IEEE International Conference on Data Mining. Washington, USA: IEEE, 2014: 660-669. [16] LI H G, WU X D, LI Z, et al. Group Feature Selection with Streaming Features // Proc of the 13th IEEE International Confe-rence on Data Mining. Washington, USA: IEEE, 2013, I: 1109-1114. [17] WANG J, ZHAO Z Q, HU X G, et al. Online Group Feature Selection // Proc of the 23rd International Joint Conference on Artificial Intelligence. Berlin, Germany: Springer, 2013: 446-453. [18] 胡清华,于达仁,谢宗霞.基于邻域粒化和粗糙逼近的数值属性约简.软件学报, 2008, 19(3): 640-649. (HU Q H, YU D R, XIE Z X. Numerical Attribute Reduction Based on Neighborhood Granulation and Rough Approximation. Journal of Software, 2008, 19(3): 640-649.) [19] ZHOU P, HU X G, LI P P, et al. Online Streaming Feature Selection Using Adapted Neighborhood Rough Set. Information Sciences, 2019, 481: 258-279. |
|
|
|