Semi-supervised Short Text Stream Classification Based on Vector Representation and Label Propagation
WANG Haiyan1 , HU Xuegang1,2 , LI Peipei1,2
1.School of Computer and Information, Hefei University of Technology, Hefei 230601 2.Anhui Province Key Laboratory of Industry Safety and Emergency Technology, Hefei University of Technology, Hefei 230009
Abstract:The huge volume of short text streams produced by social Network is fast, high-volume and it contains concept drift, short length of texts and massive unlabeled data. Therefore, a semi-supervised short text stream classification algorithm based on vector representation and label propagation is proposed in this paper to classify short text stream with a few labeled data. Besides, to adapt to the concept drift, a concept drift detection algorithm based on clusters is proposed. Experimental results on real short text streams show that the proposed algorithm improves the classification accuracy and the macro average compared with classical semi-supervised classification algorithms and semi-supervised data stream classification algorithms, and it adapts to the concept drift quickly in data stream.
王海燕, 胡学钢, 李培培. 基于向量表示和标签传播的半监督短文本数据流分类算法[J]. 模式识别与人工智能, 2018, 31(7): 634-642.
WANG Haiyan , HU Xuegang , LI Peipei. Semi-supervised Short Text Stream Classification Based on Vector Representation and Label Propagation. , 2018, 31(7): 634-642.
[1] PHAN X H, NGUYEN C T, LE D T, et al. A Hidden Topic-Based Framework toward Building Applications with Short Web Documents. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(7): 961-976. [2] CHENG X Q, YAN X H, LAN Y Y, et al. BTM: Topic Modeling over Short Texts. IEEE Transactions on Knowledge and Data Engineering. 2014, 26(12): 2928-2941. [3] TANG J, WANG Y, ZHENG K, et al. End-to-End Learning for Short Text Expansion // Proc of the 23rd ACM SIGKDD Internatio-nal Conference on Knowledge Discovery and Data Mining. New York, USA: ACM, 2017: 1105-1113. [4] CAI Y H, ZHU Q, CHENG X Y. Notice of Retraction Semi-supervised Short Text Categorization Based on Random Subspace // Proc of the 3rd IEEE International Conference on Computer Science and Information Technology. Washington, USA: IEEE, 2010: 470-473. [5] CHAN J, KOPRINSKA I, POON J, et al. Semi-supervised Classification Using Bridging. International Journal on Artificial Intelligence Tools, 2008, 17(3): 415-431. [6] YIN C Y, XIANG J, ZHANG H, et al. A New SVM Method for Short Text Classification Based on Semi-supervised Learning // Proc of the 4th International Conference on Advanced Information Technology and Sensor Application. Washington, USA: IEEE, 2015: 100-103. [7] DE SILVA N F F D, COLETTA L F S, HRUSCHKA E R, et al. Using Unsupervised Information to Improve Semi-supervised Tweet Sentiment Classification. Information Sciences, 2016, 355/356: 348-365. [8] LI X H, YAN L, QIN N, et al. A Novel Semi-supervised Short Text Classification Algorithm Based on Fusion Similarity // Proc of the International Conference on Intelligent Computing. London, UK: Springer, 2017: 309-319. [9] WIDMANN N, VERBERNE S. Graph-Based Semi-supervised Lear-ning for Text Classification // Proc of the ACM SIGIR International Conference on Theory of Information Retrieval. New York, USA: ACM, 2017: 59-66. [10] WIDMER G, KUBAT M. Learning in the Presence of Concept Drift and Hidden Contexts. Machine Learning, 1996, 23(1): 69-101. [11] ZHANG P, ZHU X Q, TAN J L, et al. Classifier and Cluster Ensembles for Mining Concept Drifting Data Streams // Proc of the IEEE International Conference on Data Mining. Washington, USA: IEEE, 2011: 1175-1180. [12] LI P P, WU X D, HU X G. Mining Recurring Concept Drifts with Limited Labeled Streaming Data // Proc of the 2nd Asian Confe-rence on Machine Learning. New York, USA: ACM, 2010: 241-252. [13] ZHU L, PANG S N, SARRAFZADEH A, et al. Incremental and Decremental Max-Flow for Online Semi-supervised Learning. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(8): 2115-2127. [14] FENG Z, WANG M, YANG S Y, et al. Incremental Semi-supervised Classification of Data Streams via Self-representative Selection. Applied Soft Computing, 2016, 47: 389-394. [15] HOSSEINI M J, GHOLIPOUR A, BEIGY H. An Ensemble of Cluster-Based Classifiers for Semi-supervised Classification of Non-stationary Data Streams. Knowledge and Information Systems, 2016, 46(3): 567-597. [16] WANG Z H, SHOU L D, CHEN K, et al. On Summarization and Timeline Generation for Evolutionary Tweet Streams. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(5): 1301-1315. [17] GAMA J, SEBASTIAO R, RODRIGUES P P, et al. On Evalua-ting Stream Learning Algorithms. Machine Learning, 2013, 90(3): 317-346. [18] ZHOU Z H, LI M. Tri-training: Exploiting Unlabeled Data Using Three Classifiers. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(11): 1529-1541. [19] CHANG C C, LIN C J. LIBSVM: A Library for Support Vector Machines. ACM Transactions on Intelligent Systems and Technology, 2011. DOI: 10.1145/1961189.1961199.