基于聚类假设的数据流分类算法<sup>*</sup>

doi:10.16451/j.cnki.issn1003-6059.201701001

摘要
图/表
参考文献
相关文章 (15)

全文: PDF (530 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要获取数据流上样本的真实类别的代价很高,因此标记所有样本的方式缺乏实用性,而随机标记部分样本又会导致模型的不稳定.针对上述问题,文中提出基于聚类假设的数据流分类算法.基于通过聚类算法分到同类中的样本可能具有相同类别这一聚类假设,利用训练数据集上的聚类结果拟合样本的分布情况,在分类阶段有目的性地选取很难分类或潜在概念漂移的样本更新模型.为了训练数据集上每个类别的样本,建立各自对应的基础分类器,当数据流中样本的类别消失或重现时,只需要冻结或激活与之对应的基础分类器,而无需再重新学习之前已经掌握的知识.实验表明,文中算法能够在适应概念漂移的前提下,减少更新模型需要的样本数量,并且取得和当前数据流上的分类算法相当或更好的分类效果.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	李南

关键词 ：概念漂移, 数据流, 分类, 聚类

Abstract：Labeling all the instances is unpractical due to the high cost of acquiring labeled data in a real streaming environment. However, labeling part of the instances leads to model instability. Aiming at these problem, a clustering assumption based classification algorithm for stream data(CASD) is proposed. It is assumed that the instances divided into the same cluster may come from the same class. Based on the clustering assumption, the clustering result is utilized to fit the distribution of each class. The instances difficult to be classified or from concept drift class are selected to update the current model. Maintaining several base learners for each class and dynamical updating them is another innovation of the proposed algorithm. When instances from a specific class disappear or reappear, the corresponding base learners are frozen or activated instead of relearning the prior knowledge. Experimental results show that with a few labeled instances, the accuracy of CASD is comparable to that of state-of-the-art algorithms and the model can adapt to concept drift rapidly.

Key words： Concept Drift Stream Data Classification Clustering

收稿日期: 2016-05-30

ZTFLH:

TP 311

基金资助:福建省自然科学基金项目(No.2016J01280)资助

作者简介: 李南,男,1987年生,硕士,助教,主要研究方向为模式识别、人工智能.E-mail:binbanbiniban@163.com.

引用本文:

李南. 基于聚类假设的数据流分类算法^*[J]. 模式识别与人工智能, 2017, 30(1): 1-10. LI Nan. Clustering Assumption Based Classification Algorithm for Stream Data. , 2017, 30(1): 1-10.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/10.16451/j.cnki.issn1003-6059.201701001 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2017/V30/I1/1

[1] HOENS T R, POLIKAR R, CHAWLA N V. Learning from Strea-ming Data with Concept Drift and Imbalance: An Overview. Progress in Artificial Intelligence, 2012, 1(1): 89-101.
[2] DOMINGOS P, HULTEN G. Mining High-Speed Data Streams // Proc of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM, 2000: 71-80.
[3] HULTEN G, SPENCER L, DOMINGOS P. Mining Time-Changing Data Streams // Proc of the 7th ACM SIGKDD International Confe-rence on Knowledge Discovery and Data Mining. New York, USA: ACM, 2001: 97-106.
[4] ABBASZADEH O, AMIRI A, KHANTEYMOORI A R. An Ensemble Method for Data Stream Classification in the Presence of Concept Drift. Frontiers of Information Technology & Electronic Engineering, 2015, 16(12): 1059-1068.
[5] STREET W N, KIM Y S. A Streaming Ensemble Algorithm (SEA) for Large-Scale Classification // Proc of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM, 2001: 377-382.
[6] ZHANG P, ZHU X Q, SHI Y, et al. An Aggregate Ensemble for Mining Concept Drifting Data Streams with Noise // Proc of the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Berlin, Germany: Springer, 2009: 1021-1029.
[7] 辛轶,郭躬德,陈黎飞,等. IKnnM-DHecoc:一种解决概念漂移问题的方法.计算机研究与发展, 2011, 48(4): 592-601.
(XIN Y, GUO G D, CHEN L F, et al. IKnnM-DHecoc: A Method for Handing the Problem of Concept Drift. Journal of Computer Research and Development, 2011, 48(4): 592-601.)
[8] BRZEZINSKI D, STEFANOWSKI J. Reacting to Different Types of Concept Drift: the Accuracy Updated Ensemble Algorithm. IEEE Transactions on Neural Networks and Learning Systems, 2014, 25(1): 81-94.
[9] KOLTER J Z, MALOOF M A. Dynamic Weighted Majority: A New Ensemble Method for Tracking Concept Drift // Proc of the 13th IEEE International Conference on Data Mining. Washington, USA: IEEE, 2003: 123-130.
[10] 李南,郭躬德,陈黎飞.基于少量类标签的概念漂移检测算法.计算机应用, 2012, 32(8): 2176-2181.
(LI N, GUO G D, CHEN L F. Concept Drift Detection Method with Limited Amount of Labeled Data. Journal of Computer Application, 2012, 32(8): 2176-2181.)
[11] FARIA E R, GONC·ALVES I J C R, DE CARVALHO A C P L F, et al. Novelty Detection in Data Streams. Artificial Intelligence Review, 2016, 45(2): 235-269.
[12] 徐文华,覃征,常扬.基于半监督学习的数据流集成分类算法.模式识别与人工智能, 2012, 25(2): 292-299.
(XU W H, QIN Z, CHANG Y. Semi-supervised Learning Based Ensemble Classifier for Stream Data. Pattern Recognition and Artificial Intelligence, 2012, 25(2): 292-299.)
[13] HOSSEINI M J, GHOLIPOUR A, BEIGY H. An Ensemble of Cluster-Based Classifiers for Semi-supervised Classification of Non-stationary Data Streams. Knowledge and Information System, 2016,
46(3): 567-597.
[14] PATRA S, BRUZAZONE L. A Cluster-Assumption Based Batch Mode Active Learning Technique. Pattern Recognition Letters, 2012, 33(9): 1042-1048.
[15] SUN Y, TANG K, MINKU L L, et al. Online Ensemble Learning of Data Streams with Gradually Evolved Classes. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(6): 1532-1545.
[16] JING L P, NG M K, HUANG J Z. An Entropy Weighting k-means Algorithm for Subspace Clustering of High-dimensional Sparse Data. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(8): 1026-1041.
[17] ARTHUR D, VASSILVITSKII S. k-means++:the Advantage of Careful Seeding // Proc of the 18th Annual ACM-SIAM Sympo-sium on Discrete Algorithms. New York, USA: ACM, 2007: 1027-1035.
[18] AGGARWAL C C, WOLF J L, YU P S, et al. Fast Algorithms for Projected Clustering // Proc of the ACM-SIGMOD International Conference on Management of Data. New York, USA: ACM, 1999: 61-71.
[19] 陈黎飞,郭躬德.最近邻分类的多代表点学习算法.模式识别与人工智能, 2011, 24(6): 882-888.
(CHEN L F, GUO G D. Multi-representatives Learning Algorithm for Nearest Neighbor Classification. Pattern Recognition and Artificial Intelligence, 2011, 24(6): 882-888.)
[20] BURGES C J C. A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 1998, 2(2): 121-167.
[21] LI N, GUO G D, CHEN L F, et al. Optimal Subspace Classification Method for Complex Data. International Journal of Machine Learning and Cybernetics, 2012, 4(2): 163-171.
[22] 郭躬德,李南,陈黎飞.一种基于混合模型的数据流概念漂移检测算法.计算机研究与发展, 2014, 51(4): 731-742.
(GUO G D, LI N, CHEN L F. Concept Drift Detection for Data Streams Based on Mixture Model. Journal of Computer Research and Development, 2014, 51(4): 731-742.)