基于随机投影的并行数据流聚类方法<sup>*</sup>

摘要
图/表
参考文献
相关文章 (10)

全文: PDF (492 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要利用数据流的遗忘特性，应用随机投影，分层、动态地维护每个数据流的概要结构.基于该概要结构，快速计算数据流和聚类中心之间的近似距离，实现一种适合并行多数据流的K-means聚类方法.所进行的实验验证该方法的有效性.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	陈华辉
	施伯乐

关键词 ：概要结构, 遗忘特性, 随机投影, 数据流

Abstract：A synopsis is maintained dynamically for each data stream. The construction of the synopsis is based on random projections and it utilizes the amnesic feature of data stream. Using the synopsis, the approximate distances between streams and the cluster center can be computed fast. And an efficient online version of the classical K-means clustering algorithm is developed. The experimental results show the method can be performed effectively with a good clustering quality.

Key words： Synopsis Amnesic Feature Random Projection Data Stream

收稿日期: 2007-10-18

ZTFLH:

TP391

基金资助:国家自然科学基金项目(No.60773072)、浙江省自然科学基金项目(No.Y104144)和浙江省教育厅项目(No.20051737)资助

作者简介: 陈华辉，男，1964年生，博士研究生，副教授，主要研究方向为数据流、数据挖掘.E-mail: chenhuahui@nbu.edu.cn.施伯乐，男，1936年生，教授，博士生导师，主要研究方向为数据库与知识库.

引用本文:

陈华辉，施伯乐. 基于随机投影的并行数据流聚类方法^*[J]. 模式识别与人工智能, 2009, 22(1): 113-122. CHEN Hua-Hui, SHI Bo-Le. Random Projection Based Clustering Method of Parallel Data Streams. , 2009, 22(1): 113-122.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/ 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2009/V22/I1/113

[1] Keogh E, Kasetty S. On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration. Data Mining and Knowledge Discovery, 2003, 7(4): 349-371
[2] Guha S, Meyerson A, Mishra N, et al. Clustering Data Streams: Theory and Practice. IEEE Trans on Knowledge and Data Engineering, 2003, 15(3): 515-528
[3] Aggarwal C C, Han Jiawei, Wang Jianyong, et al. A Framework for Clustering Evolving Data Streams //Proc of the 29th International Conference on Very Large Data Base. Berlin, Germany, 2003: 81-92
[4] Charikar M, O'Callaghan L, Panigrahy R. Better Streaming Algorithms for Clustering Problems // Proc of the 35th Annual ACM Symposium on Theory of Computing. San Diego, USA, 2003: 30-39
[5] Beringer J, Hullermeier E. Online Clustering of Parallel Data Streams. Data & Knowledge Engineering, 2006, 58(2): 180-204
[6] Yeh M Y, Dai Biru, Chen M S. Clustering over Multiple Evolving Streams by Events and Correlations. IEEE Trans on Knowledge and Data Engineering, 2007, 19(10): 1349-1362
[7] Johnson W B, Lindenstrauss J. Extensions of Lipschitz Mappings into a Hilbert Space. Contemporary Mathematics, 1984, 26(1): 189-206
[8] Achlioptas D. Database-Friendly Random Projections //Proc of the 20th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. Santa Barbara, USA, 2001: 274-281
[9] Linial N, London E, Rabinovich Y. The Geometry of Graphs and Some of Its Algorithmic Applications. Combinatorica, 1995, 15(2): 215-245
[10] Dasgupta S, Gupta A. An Elementary Proof of a Theorem of Johnson and Lindenstrauss. Random Structures & Algorithms, 2003, 22(1): 60-65
[11] Cormode G, Datar M, Indyk P, et al. Comparing Data Streams Using Hamming Norms (How to Zero in). IEEE Trans on Knowledge and Data Engineering, 2003, 15(3): 529-540
[12] Indyk P. Stable Distributions, Pseudorandom Generators, Embeddings, and Data Stream Computation. Journal of the ACM, 2006, 53(3): 307-323
[13] Indyk P, Koudas N, Muthukrishnan S. Identifying Representative Trends in Massive Time Series Data Sets Using Sketches //Proc of the 26th International Conference on Very Large Data Bases. Cairo, Egypt, 2000: 363-372
[14] Fern X Z, Brodly C E. Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach //Proc of the 20th International Conference on Machine Learning. Washington, USA, 2003: 186-193
[15] Gilbert A C, Kotidis Y, Muthukrishnan S, et al. One-Pass Wavelet Decompositions of Data Streams. IEEE Trans on Knowledge and Data Engineering, 2003, 15(3): 541-554
[16] Thaper N, Guha S, Indyk P, et al. Dynamic Multidimensional Histograms //Proc of the ACM SIGMOD International Conference on Management of Data. Madison, USA, 2002: 428-439
[17] Cormode G, Muthukrishnan S. An Improved Data Stream Summary: The Count-Min Sketch and Its Applications. Journal of Algorithms, 2005, 55(1): 58-75
[18] Aggarwal C C, Yu P S. Data Streams: Models and Algorithms. Berlin, Germany: Springer, 2007
[19] Palpanas T, Vlachos M, Keogh E, et al. Online Amnesic Approximation of Streaming Time Series //Proc of the 20th International Conference on Data Engineering. Boston, USA, 2004: 339-349
[20] Zhao Yanchang, Zhang Shichao. Generalized Dimension-Reduction Framework for Recent-Biased Time Series Analysis. IEEE Trans on Knowledge and Data Engineering, 2006, 18(2): 231-244
[21] Bulut A, Singh A K. SWAT: Hierarchical Stream Summarization in Large Networks //Proc of the 19th International Conference on Data Engineering. Vienna, Austria, 2003: 303-314
[22] Potamias M, Patroumpas K, Sellis T. Amnesic Online Synopses for Moving Objects //Proc of the 15th ACM International Conference on Information and Knowledge Management. Arlington, USA, 2006: 784-785
[23] Yu P S, Wang Jianyong, Aggarwal C C, et al. A Framework for On-Demand Classification of Evolving Data Streams. IEEE Trans on Knowledge and Data Engineering, 2006, 18(5): 577-589
[24] Lerman I, Costa J, Silva H. Validation of Very Large Data Sets Clustering by Means of a Nonparametric Linear Criterion // Tajuga K, Sokolowski A, Bock H H, eds. Classification, Clustering and Data Analysis: Recent Advances and Application. Berlin, Germany: Springer, 2002: 147-157
[25] Gavrilov M, Anguelov D, Indyk P, et al. Mining the Stock Market: Which Measure Is Best? //Proc of the 6th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. Boston, USA, 2000: 487-496
[26] Keogh E, Xi Xiaopeng, Wei Li, et al. The UCR Time Series Classification/Clustering Homepage [DB/OL]. [2007-08-01]. http:// www.cs.ucr.edu/~eamonn/time_series_data