Semi-Supervised Learning Based Ensemble Classifier for Stream Data
XU Wen-Hua 1, QIN Zheng 1, 2, CHANG Yang 2
1.Department of Computer Science and Technology,School of Information Science and Technology,Tsinghua University,Beijing 100084 2.School of Software,School of Information Science and Technology,Tsinghua University,Beijing 100084
Abstract:Stream data classification algorithms are mainly based on supervised learning strategy, and they need massive labeled data for training. These approaches are unpractical due to the high cost of acquiring labeled data in a real streaming environment. A semi-supervised learning based ensemble classifier (SEClass) is presented for stream data classification. SEClass utilizes both a small number of labeled data and a great number of unlabeled data to train an ensemble classifier, and unlabeled instances are classified using the majority voting strategy. The experimental results show that the accuracy of SEClass is 5.33% higher in average than that of the state-of-the-art supervised method using the same number of labeled data for training. And SEClass is suitable for high-dimensional high-speed massive stream data classification.
[1] Han Jiawei,Kamber M.Data Mining: Concepts and Techniques.2nd Edition.Singapore,Singapore: Elsevier,2006 [2] Wang Haixun,Fan Wei,Yu P S,et al.Mining Concept-Drifting Data Streams Using Ensemble Classifiers // Proc of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Washington,USA,2003: 226-235 [3] Aggarwal C.Data Streams: Models and Algorithms.Berlin,Germany: Springer,2007 [4] Gehrke J,Ganti V,Ramakrishnan R,et al.Boat-Optimistic Decision Tree Construction // Proc of the ACM SIGMOD International Conference on Management of Data.Philadelphia,USA,1999: 169-180 [5] Domingos P,Hulten G.Mining High-Speed Data Streams // Proc of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Boston,USA,2000: 71–80 [6] Hulten G,Spencer L,Domingos P.Mining Time-Changing Data Streams // Proc of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.San Francisco,USA,2001: 97-106 [7] Scholz M,Klinkenberg R.An Ensemble Classifier for Drifting Concepts // Proc of the 2nd International Workshop on Knowledge Discovery in Data Streams.Portugal,Porto,2005: 53-64 [8] Aggarwal C C,Han J,Wang Jianyong,et al.A Framework for On-Demand Classification of Evolving Data Streams.IEEE Trans on Knowledge and Data Engineering,2006,18(5): 577-589 [9] Masud M M,Gao Jing,Khan L,et al.A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data // Proc of the 8th International Conference on Data Mining.Pisa,Italy,2008: 929-934 [10] Bifet A,Holmes G,Pfahringer B,et al.New Ensemble Methods for Evolving Data Streams // Proc of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Ming.Paris,France,2009: 139-148 [11] Tumer K,Ghosh J.Error Correlation and Error Reduction in Ensemble Classifiers.Connection Science,1996,18(3): 385-403 [12] Chapelle O,Schoelkopf B,Zien A.Semi-Supervised Learning.Cambridge,USA: MIT Press,2006 [13] Simon G J,Kumar V,Zhang Zhili.Semi-Supervised Approach to Rapid and Reliable Labeling of Large Data Sets // Proc of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Las Vegas,USA,2008: 641-649 [14] Tsai C,Chiu C.Developing a Feature Weight Self-Adjustment Mechanism for a K-Means Clustering Algorithm.Computational Statistics and Data Analysis,2008,52(10): 4658-4672 [15] Breiman L.Bagging Predictors.Machine Learning,1996,24(2): 123-140 [16] Bifet A,Kirkby R,Holmes G,et al.MOA: Massive Online Analysis [EB/OL].[2011-05-05].http://sourceforge.net/projects /moa-datastream