基于矩阵谱分析的文本聚类集成算法<sup>*</sup>

摘要
图/表
参考文献
相关文章 (15)

全文: PDF (410 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要聚类集成技术可有效提高单聚类算法的精度和稳定性，其中的关键问题是如何根据不同的聚类成员组合为更好的聚类结果.文中引入谱聚类算法解决文本聚类集成问题，设计基于正则化拉普拉斯矩阵的谱算法(NLMSA).该算法基于代数变换，通过求解小规模矩阵的特征值和特征向量间接获得正则化拉普拉斯矩阵的特征向量，并用于后续聚类.进一步研究谱聚类算法的关键思想，设计基于超边转移概率矩阵的谱算法(HTMSA).该算法通过求解超边的低维嵌入间接获得文本的低维嵌入，并用于后续K均值算法.在TREC和Reuters文本集上的实验结果验证NLMSA和HTMSA的有效性，它们都获得比其它基于图划分的集成算法更为优越的结果.HTMSA获得的结果比NLMSA略差，而时间和空间需求则比NLMSA低得多.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	徐森
	卢志茂
	顾国昌

关键词 ：聚类分析, 聚类集成, 谱聚类, 文本聚类, 矩阵低秩近似

Abstract：Cluster ensemble techniques are effective in improving both the robustness and the stability of the single clustering algorithm. How to combine multiple clusters to yield a final superior clustering result is critical in cluster ensemble. Spectral clustering algorithm is introduced to solve document cluster ensemble problem. Normalized Laplacian matrix-based spectral algorithm (NLMSA) is proposed. According to algebraic transformation, it computes eigenvalues and eigenvectors of a small matrix to obtain the eigenvectors of normalized Laplacian matrix. The key idea of spectral clustering algorithm is further investigated, and hyperedge transition matrix-based spectral algorithm (HTMSA) is proposed. It attains the low dimensional embeddings of documents by those of hyperedges and then the K-means algorithm is used to cluster according to those embedding results of documents. Experimental results on TREC and Reuters document sets demonstrate the effectiveness of the proposed algorithms. Both NLMSA and HTMSA outperform other cluster ensemble techniques based on graph partitioning. NLMSA obtains better results than HTMSA while the computational cost of HTMSA is much lower than that of NLMSA.

Key words： Clustering Analysis Cluster Ensemble Spectral Clustering Document Clustering Low Rank Approximation of Matrix

收稿日期: 2008-09-04

ZTFLH:

TP391

基金资助:国家自然科学基金(No.60603092)、国家教育部博士点基金(No.20070217043)资助项目

作者简介: 徐森，男，1983年生，博士研究生，主要研究方向为人工智能、机器学习、文本挖掘.E-mail: xusen@hrbeu.edu.cn.卢志茂，男，1972年生，教授，博士生导师，主要研究方向为人工智能、智能信息处理、文本挖掘.顾国昌，男，1946年生，教授，博士生导师，主要研究方向为人工智能、智能机器人.

引用本文:

徐森，卢志茂，顾国昌. 基于矩阵谱分析的文本聚类集成算法^*[J]. 模式识别与人工智能, 2009, 22(5): 780-786. XU Sen, LU Zhi-Mao, GU Guo-Chang. Document Cluster Ensemble Algorithms Based on Matrix Spectral Analysis. , 2009, 22(5): 780-786.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/ 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2009/V22/I5/780

[1] Tan P N, Steinbach M, Kumar V. Introduction to Data Mining. Toronto, Canada: Addison-Wesley Longman, 2005
[2] Strehl A, Ghosh J. Cluster Ensembles—A Knowledge Reuse Framework for Combining Partitionings // Proc of the 11th Conference on Artificial Intelligence. Edmonton, Canada, 2002: 93-98
[3] Fred A L, Jain A K. Combining Multiple Clusterings Using Evidence Accumulation. IEEE Trans on Pattern Analysis and Machine Intelligence, 2005, 27(6): 835-850
[4] Fern X Z, Brodley C E. Solving Cluster Ensemble Problems by Bipartite Graph Partitioning // Proc of the 20th International Conference on Machine Learning. Banff, Canada, 2004: 36-43
[5] Topchy A, Jain A K, Punch W. A Mixture Model for Clustering Ensembles // Proc of the 4th SIAM International Conference on Data Mining. Lake Buena Vista, USA, 2004: 379-390
[6] Ayad H, Basir O A, Kamel M. A Probabilistic Model Using Information Theoretic Measures for Cluster Ensembles // Proc of the 5th International Workshop on Multiple Classifier Systems. Cagliari, Italy, 2004: 144-153
[7] Tang Wei, Zhou Zhihua. Bagging-Based Selective Cluster Ensemble. Journal of Software, 2005, 16(4): 496-502 (in Chinese)
(唐伟,周志华.基于Bagging的选择性聚类集成.软件学报, 2005, 16(4): 496-502)
[8] Fern X Z, Lin W. Cluster Ensemble Selection. Statistical Analysis and Data Mining. 2008, 1(3): 128-141
[9] Luo Huilan, Kong Fansheng, Li Yixiao. An Analysis of Diversity Measures in Clustering Ensembles. Chinese Journal of Computers, 2007, 30(8): 1315-1324 (in Chinese)
(罗会兰,孔繁胜,李一啸.聚类集成中的差异性度量研究.计算机学报, 2007, 30(8): 1315-1324)
[10] Luxburg U V. A Tutorial on Spectral Clustering. Statistics and Computing, 2007, 17(4): 395-416
[11] Shi Jianbo, Malik J. Normalized Cuts and Image Segmentation. IEEE Trans on Pattern Analysis and Machine Intelligence, 2000, 22(8): 888-905
[12] Ng A Y, Jordan M I, Weiss Y. On Spectral Clustering: Analysis and an Algorithm [EB/OL]. [2008-04-10]. http://books.nips.cc/papers/files/nips14/AA35.pdf
[13] Meila M, Shi J. A Random Walks View of Spectral Segmentation // Proc of the 8th International Workshop on Artificial Intelligence and Statistics. Key West, USA, 2001: 31-37
[14] Karypis G, Kumar V. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs. SIAM Journal on Scientific Computing, 1998, 20(1): 359-392
[15] Karypis G, Aggarwal R, Kumar V, et al. Multilevel Hypergraph Partitioning: Applications in VLSI Domain // Proc of the 34th Annual Conference on Design Automation. Anaheim, USA, 1997: 526-529
[16] Dhillon I S. Co-Clustering Documents and Words Using Bipartite Spectral Graph Partitioning // Proc of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco, USA, 2001: 269-274
[17] von Luxburg U, Belkin M, Bousquet O. Consistency of Spectral Clustering. Annals of Statistics, 2008, 36(2): 555-586
[18] Berry M W. Large-Scale Sparse Singular Value Computations. The International Journal of Supercomputer Applications, 1992, 6(1): 13-49
[19] National Institute of Standards and Technology. Text Retrieval Conference [DB/OL]. [2007-11-20]. http://trec.nist.gov
[20] Lewis D D. Reuters-21578~1.0 [DB/OL]. [2008-07-10]. http://www.daviddlewis.com/resources/testcouecthons/reaters21578