Document Cluster Ensemble Algorithms Based on Matrix Spectral Analysis
XU Sen1, LU Zhi-Mao2, GU Guo-Chang1
1.College of Computer Science and Technology, Harbin Engineering University, Harbin 150001 2.College of Information and Communication Engineering, Harbin Engineering University, Harbin 150001
Abstract:Cluster ensemble techniques are effective in improving both the robustness and the stability of the single clustering algorithm. How to combine multiple clusters to yield a final superior clustering result is critical in cluster ensemble. Spectral clustering algorithm is introduced to solve document cluster ensemble problem. Normalized Laplacian matrix-based spectral algorithm (NLMSA) is proposed. According to algebraic transformation, it computes eigenvalues and eigenvectors of a small matrix to obtain the eigenvectors of normalized Laplacian matrix. The key idea of spectral clustering algorithm is further investigated, and hyperedge transition matrix-based spectral algorithm (HTMSA) is proposed. It attains the low dimensional embeddings of documents by those of hyperedges and then the K-means algorithm is used to cluster according to those embedding results of documents. Experimental results on TREC and Reuters document sets demonstrate the effectiveness of the proposed algorithms. Both NLMSA and HTMSA outperform other cluster ensemble techniques based on graph partitioning. NLMSA obtains better results than HTMSA while the computational cost of HTMSA is much lower than that of NLMSA.
[1] Tan P N, Steinbach M, Kumar V. Introduction to Data Mining. Toronto, Canada: Addison-Wesley Longman, 2005 [2] Strehl A, Ghosh J. Cluster Ensembles—A Knowledge Reuse Framework for Combining Partitionings // Proc of the 11th Conference on Artificial Intelligence. Edmonton, Canada, 2002: 93-98 [3] Fred A L, Jain A K. Combining Multiple Clusterings Using Evidence Accumulation. IEEE Trans on Pattern Analysis and Machine Intelligence, 2005, 27(6): 835-850 [4] Fern X Z, Brodley C E. Solving Cluster Ensemble Problems by Bipartite Graph Partitioning // Proc of the 20th International Conference on Machine Learning. Banff, Canada, 2004: 36-43 [5] Topchy A, Jain A K, Punch W. A Mixture Model for Clustering Ensembles // Proc of the 4th SIAM International Conference on Data Mining. Lake Buena Vista, USA, 2004: 379-390 [6] Ayad H, Basir O A, Kamel M. A Probabilistic Model Using Information Theoretic Measures for Cluster Ensembles // Proc of the 5th International Workshop on Multiple Classifier Systems. Cagliari, Italy, 2004: 144-153 [7] Tang Wei, Zhou Zhihua. Bagging-Based Selective Cluster Ensemble. Journal of Software, 2005, 16(4): 496-502 (in Chinese) (唐 伟,周志华.基于Bagging的选择性聚类集成.软件学报, 2005, 16(4): 496-502) [8] Fern X Z, Lin W. Cluster Ensemble Selection. Statistical Analysis and Data Mining. 2008, 1(3): 128-141 [9] Luo Huilan, Kong Fansheng, Li Yixiao. An Analysis of Diversity Measures in Clustering Ensembles. Chinese Journal of Computers, 2007, 30(8): 1315-1324 (in Chinese) (罗会兰,孔繁胜,李一啸.聚类集成中的差异性度量研究.计算机学报, 2007, 30(8): 1315-1324) [10] Luxburg U V. A Tutorial on Spectral Clustering. Statistics and Computing, 2007, 17(4): 395-416 [11] Shi Jianbo, Malik J. Normalized Cuts and Image Segmentation. IEEE Trans on Pattern Analysis and Machine Intelligence, 2000, 22(8): 888-905 [12] Ng A Y, Jordan M I, Weiss Y. On Spectral Clustering: Analysis and an Algorithm [EB/OL]. [2008-04-10]. http://books.nips.cc/papers/files/nips14/AA35.pdf [13] Meila M, Shi J. A Random Walks View of Spectral Segmentation // Proc of the 8th International Workshop on Artificial Intelligence and Statistics. Key West, USA, 2001: 31-37 [14] Karypis G, Kumar V. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs. SIAM Journal on Scientific Computing, 1998, 20(1): 359-392 [15] Karypis G, Aggarwal R, Kumar V, et al. Multilevel Hypergraph Partitioning: Applications in VLSI Domain // Proc of the 34th Annual Conference on Design Automation. Anaheim, USA, 1997: 526-529 [16] Dhillon I S. Co-Clustering Documents and Words Using Bipartite Spectral Graph Partitioning // Proc of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco, USA, 2001: 269-274 [17] von Luxburg U, Belkin M, Bousquet O. Consistency of Spectral Clustering. Annals of Statistics, 2008, 36(2): 555-586 [18] Berry M W. Large-Scale Sparse Singular Value Computations. The International Journal of Supercomputer Applications, 1992, 6(1): 13-49 [19] National Institute of Standards and Technology. Text Retrieval Conference [DB/OL]. [2007-11-20]. http://trec.nist.gov [20] Lewis D D. Reuters-21578~1.0 [DB/OL]. [2008-07-10]. http://www.daviddlewis.com/resources/testcouecthons/reaters21578