Abstract:Labeling all the instances is unpractical due to the high cost of acquiring labeled data in a real streaming environment. However, labeling part of the instances leads to model instability. Aiming at these problem, a clustering assumption based classification algorithm for stream data(CASD) is proposed. It is assumed that the instances divided into the same cluster may come from the same class. Based on the clustering assumption, the clustering result is utilized to fit the distribution of each class. The instances difficult to be classified or from concept drift class are selected to update the current model. Maintaining several base learners for each class and dynamical updating them is another innovation of the proposed algorithm. When instances from a specific class disappear or reappear, the corresponding base learners are frozen or activated instead of relearning the prior knowledge. Experimental results show that with a few labeled instances, the accuracy of CASD is comparable to that of state-of-the-art algorithms and the model can adapt to concept drift rapidly.
[1] HOENS T R, POLIKAR R, CHAWLA N V. Learning from Strea-ming Data with Concept Drift and Imbalance: An Overview. Progress in Artificial Intelligence, 2012, 1(1): 89-101. [2] DOMINGOS P, HULTEN G. Mining High-Speed Data Streams // Proc of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM, 2000: 71-80. [3] HULTEN G, SPENCER L, DOMINGOS P. Mining Time-Changing Data Streams // Proc of the 7th ACM SIGKDD International Confe-rence on Knowledge Discovery and Data Mining. New York, USA: ACM, 2001: 97-106. [4] ABBASZADEH O, AMIRI A, KHANTEYMOORI A R. An Ensemble Method for Data Stream Classification in the Presence of Concept Drift. Frontiers of Information Technology & Electronic Engineering, 2015, 16(12): 1059-1068. [5] STREET W N, KIM Y S. A Streaming Ensemble Algorithm (SEA) for Large-Scale Classification // Proc of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM, 2001: 377-382. [6] ZHANG P, ZHU X Q, SHI Y, et al. An Aggregate Ensemble for Mining Concept Drifting Data Streams with Noise // Proc of the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Berlin, Germany: Springer, 2009: 1021-1029. [7] 辛 轶,郭躬德,陈黎飞,等. IKnnM-DHecoc:一种解决概念漂移问题的方法.计算机研究与发展, 2011, 48(4): 592-601. (XIN Y, GUO G D, CHEN L F, et al. IKnnM-DHecoc: A Method for Handing the Problem of Concept Drift. Journal of Computer Research and Development, 2011, 48(4): 592-601.) [8] BRZEZINSKI D, STEFANOWSKI J. Reacting to Different Types of Concept Drift: the Accuracy Updated Ensemble Algorithm. IEEE Transactions on Neural Networks and Learning Systems, 2014, 25(1): 81-94. [9] KOLTER J Z, MALOOF M A. Dynamic Weighted Majority: A New Ensemble Method for Tracking Concept Drift // Proc of the 13th IEEE International Conference on Data Mining. Washington, USA: IEEE, 2003: 123-130. [10] 李 南,郭躬德,陈黎飞.基于少量类标签的概念漂移检测算法.计算机应用, 2012, 32(8): 2176-2181. (LI N, GUO G D, CHEN L F. Concept Drift Detection Method with Limited Amount of Labeled Data. Journal of Computer Application, 2012, 32(8): 2176-2181.) [11] FARIA E R, GONC·ALVES I J C R, DE CARVALHO A C P L F, et al. Novelty Detection in Data Streams. Artificial Intelligence Review, 2016, 45(2): 235-269. [12] 徐文华,覃 征,常 扬.基于半监督学习的数据流集成分类算法.模式识别与人工智能, 2012, 25(2): 292-299. (XU W H, QIN Z, CHANG Y. Semi-supervised Learning Based Ensemble Classifier for Stream Data. Pattern Recognition and Artificial Intelligence, 2012, 25(2): 292-299.) [13] HOSSEINI M J, GHOLIPOUR A, BEIGY H. An Ensemble of Cluster-Based Classifiers for Semi-supervised Classification of Non-stationary Data Streams. Knowledge and Information System, 2016, 46(3): 567-597. [14] PATRA S, BRUZAZONE L. A Cluster-Assumption Based Batch Mode Active Learning Technique. Pattern Recognition Letters, 2012, 33(9): 1042-1048. [15] SUN Y, TANG K, MINKU L L, et al. Online Ensemble Learning of Data Streams with Gradually Evolved Classes. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(6): 1532-1545. [16] JING L P, NG M K, HUANG J Z. An Entropy Weighting k-means Algorithm for Subspace Clustering of High-dimensional Sparse Data. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(8): 1026-1041. [17] ARTHUR D, VASSILVITSKII S. k-means++:the Advantage of Careful Seeding // Proc of the 18th Annual ACM-SIAM Sympo-sium on Discrete Algorithms. New York, USA: ACM, 2007: 1027-1035. [18] AGGARWAL C C, WOLF J L, YU P S, et al. Fast Algorithms for Projected Clustering // Proc of the ACM-SIGMOD International Conference on Management of Data. New York, USA: ACM, 1999: 61-71. [19] 陈黎飞,郭躬德.最近邻分类的多代表点学习算法.模式识别与人工智能, 2011, 24(6): 882-888. (CHEN L F, GUO G D. Multi-representatives Learning Algorithm for Nearest Neighbor Classification. Pattern Recognition and Artificial Intelligence, 2011, 24(6): 882-888.) [20] BURGES C J C. A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 1998, 2(2): 121-167. [21] LI N, GUO G D, CHEN L F, et al. Optimal Subspace Classification Method for Complex Data. International Journal of Machine Learning and Cybernetics, 2012, 4(2): 163-171. [22] 郭躬德,李 南,陈黎飞.一种基于混合模型的数据流概念漂移检测算法.计算机研究与发展, 2014, 51(4): 731-742. (GUO G D, LI N, CHEN L F. Concept Drift Detection for Data Streams Based on Mixture Model. Journal of Computer Research and Development, 2014, 51(4): 731-742.)