Random Projection Based Clustering Method of Parallel Data Streams
CHEN Hua-Hui1,2, SHI Bo-Le1
1.Department of Computing and Information Technology, Fudan University, Shanghai 200433 2.School of Information Science and Engineering, Ningbo University, Ningbo 315211
Abstract:A synopsis is maintained dynamically for each data stream. The construction of the synopsis is based on random projections and it utilizes the amnesic feature of data stream. Using the synopsis, the approximate distances between streams and the cluster center can be computed fast. And an efficient online version of the classical K-means clustering algorithm is developed. The experimental results show the method can be performed effectively with a good clustering quality.
陈华辉,施伯乐. 基于随机投影的并行数据流聚类方法*[J]. 模式识别与人工智能, 2009, 22(1): 113-122.
CHEN Hua-Hui, SHI Bo-Le. Random Projection Based Clustering Method of Parallel Data Streams. , 2009, 22(1): 113-122.
[1] Keogh E, Kasetty S. On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration. Data Mining and Knowledge Discovery, 2003, 7(4): 349-371 [2] Guha S, Meyerson A, Mishra N, et al. Clustering Data Streams: Theory and Practice. IEEE Trans on Knowledge and Data Engineering, 2003, 15(3): 515-528 [3] Aggarwal C C, Han Jiawei, Wang Jianyong, et al. A Framework for Clustering Evolving Data Streams //Proc of the 29th International Conference on Very Large Data Base. Berlin, Germany, 2003: 81-92 [4] Charikar M, O'Callaghan L, Panigrahy R. Better Streaming Algorithms for Clustering Problems // Proc of the 35th Annual ACM Symposium on Theory of Computing. San Diego, USA, 2003: 30-39 [5] Beringer J, Hullermeier E. Online Clustering of Parallel Data Streams. Data & Knowledge Engineering, 2006, 58(2): 180-204 [6] Yeh M Y, Dai Biru, Chen M S. Clustering over Multiple Evolving Streams by Events and Correlations. IEEE Trans on Knowledge and Data Engineering, 2007, 19(10): 1349-1362 [7] Johnson W B, Lindenstrauss J. Extensions of Lipschitz Mappings into a Hilbert Space. Contemporary Mathematics, 1984, 26(1): 189-206 [8] Achlioptas D. Database-Friendly Random Projections //Proc of the 20th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. Santa Barbara, USA, 2001: 274-281 [9] Linial N, London E, Rabinovich Y. The Geometry of Graphs and Some of Its Algorithmic Applications. Combinatorica, 1995, 15(2): 215-245 [10] Dasgupta S, Gupta A. An Elementary Proof of a Theorem of Johnson and Lindenstrauss. Random Structures & Algorithms, 2003, 22(1): 60-65 [11] Cormode G, Datar M, Indyk P, et al. Comparing Data Streams Using Hamming Norms (How to Zero in). IEEE Trans on Knowledge and Data Engineering, 2003, 15(3): 529-540 [12] Indyk P. Stable Distributions, Pseudorandom Generators, Embeddings, and Data Stream Computation. Journal of the ACM, 2006, 53(3): 307-323 [13] Indyk P, Koudas N, Muthukrishnan S. Identifying Representative Trends in Massive Time Series Data Sets Using Sketches //Proc of the 26th International Conference on Very Large Data Bases. Cairo, Egypt, 2000: 363-372 [14] Fern X Z, Brodly C E. Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach //Proc of the 20th International Conference on Machine Learning. Washington, USA, 2003: 186-193 [15] Gilbert A C, Kotidis Y, Muthukrishnan S, et al. One-Pass Wavelet Decompositions of Data Streams. IEEE Trans on Knowledge and Data Engineering, 2003, 15(3): 541-554 [16] Thaper N, Guha S, Indyk P, et al. Dynamic Multidimensional Histograms //Proc of the ACM SIGMOD International Conference on Management of Data. Madison, USA, 2002: 428-439 [17] Cormode G, Muthukrishnan S. An Improved Data Stream Summary: The Count-Min Sketch and Its Applications. Journal of Algorithms, 2005, 55(1): 58-75 [18] Aggarwal C C, Yu P S. Data Streams: Models and Algorithms. Berlin, Germany: Springer, 2007 [19] Palpanas T, Vlachos M, Keogh E, et al. Online Amnesic Approximation of Streaming Time Series //Proc of the 20th International Conference on Data Engineering. Boston, USA, 2004: 339-349 [20] Zhao Yanchang, Zhang Shichao. Generalized Dimension-Reduction Framework for Recent-Biased Time Series Analysis. IEEE Trans on Knowledge and Data Engineering, 2006, 18(2): 231-244 [21] Bulut A, Singh A K. SWAT: Hierarchical Stream Summarization in Large Networks //Proc of the 19th International Conference on Data Engineering. Vienna, Austria, 2003: 303-314 [22] Potamias M, Patroumpas K, Sellis T. Amnesic Online Synopses for Moving Objects //Proc of the 15th ACM International Conference on Information and Knowledge Management. Arlington, USA, 2006: 784-785 [23] Yu P S, Wang Jianyong, Aggarwal C C, et al. A Framework for On-Demand Classification of Evolving Data Streams. IEEE Trans on Knowledge and Data Engineering, 2006, 18(5): 577-589 [24] Lerman I, Costa J, Silva H. Validation of Very Large Data Sets Clustering by Means of a Nonparametric Linear Criterion // Tajuga K, Sokolowski A, Bock H H, eds. Classification, Clustering and Data Analysis: Recent Advances and Application. Berlin, Germany: Springer, 2002: 147-157 [25] Gavrilov M, Anguelov D, Indyk P, et al. Mining the Stock Market: Which Measure Is Best? //Proc of the 6th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. Boston, USA, 2000: 487-496 [26] Keogh E, Xi Xiaopeng, Wei Li, et al. The UCR Time Series Classification/Clustering Homepage [DB/OL]. [2007-08-01]. http:// www.cs.ucr.edu/~eamonn/time_series_data