基于约束主成份分析的文本聚类算法

摘要
图/表
参考文献
相关文章 (15)

全文: PDF (369 KB) HTML (0 KB)
输出: BibTeX | EndNote (RIS)

摘要主成份分析对高维数据进行维数约简可有效提高聚类算法的性能，但这种方法容易丢失部分对聚类具有贡献的成份.为在维数约简的同时保留对聚类具有贡献的成份，提出一种维数约简与聚类交互进行的迭代算法.每次迭代可表示为约束优化问题，并可求解此优化问题的解析解，进而给出相应的迭代聚类算法，称之为基于约束主成份分析的本文聚类.在Reuter21578、WebKB文档集上的实验结果表明，文中方法与k-均值聚类、非负矩阵分解聚类和谱聚类相比具有较好的性能.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	王明文
	叶浩
	左家莉

关键词 ：约束主成份分析, 约束优化, 聚类, 迭代

Abstract：Principal component analysis is an effective method to improve the performance of clustering in high dimension. On the other hand,principal component analysis is easy to lose the components which benefits for clustering. In order to preserve these beneficial components,an iteration algorithm of dimensionality reduction and clustering,named constrained principal component clustering,is proposed. Each iteration step can be represented as a constrained optimization problem which has a analytical solution. This iterative clustering algorithm is called document clustering based on constrained principal component analysis. The experimental results on Reuter21578 and WebKB show that the proposed algorithm outperforms to k-means,Non-Negative Matrix Decomposition and Spectral Clustering.

Key words： Constrained Principal Component Analysis Constrained Optimization Clustering Iteration

收稿日期: 2012-02-13

ZTFLH:

TP391.4

基金资助:国家自然科学基金资助项目(No.60963014，61163006)

作者简介: 王明文(通讯作者)，男，1964年生，博士，教授，主要研究方向为信息检索、文本分类、机器学习.E-mail:mwwang@jxnu.edu.cn.叶浩，男，1978年生，讲师，博士研究生，主要研究方向为信息检索、文本分类、机器学习.左家莉，女，1982年生，博士，讲师，主要研究方向为信息检索、文本分类、机器学习.

引用本文:

王明文，叶浩，左家莉. 基于约束主成份分析的文本聚类算法[J]. 模式识别与人工智能, 2013, 26(3): 270-275. WANG Ming-Wen,YE Hao,ZUO Jia-Li. Document Clustering Based on Constrained Principal Component Analysis. , 2013, 26(3): 270-275.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/ 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2013/V26/I3/270

[1] Jain A K,Murty M N,Flynn P J. Data Clustering: A Review. ACM Computing Surveys (CSUR),1999,31(3): 264-323
[2] Ding Shifei,Shi Zhongzhi,Jin Fengxiang,et al. A Direct Clustering Algorithm Based on Generalized Information Distance. Journal of Computer Research and Development,2007,44(4): 674-679 (in Chinese)
(丁世飞,史忠植,靳奉祥,等.基于广义信息距离的直接聚类算法.计算机研究与发展,2007,44(4): 674-679)
[3] Li Yujian. An Adaptive k-means Clustering Algorithm. Journal of Computer Research and Development,2007,44(22): 100-104 (in Chinese)
(李玉鑑.自适应k-均值聚类算法.计算机研究与发展,2007,44(22): 100-104)
[4] Bishop C M. Pattern Recognition and Machine Learning. New York,USA: Springer-Verlag,2006
[5] Xu Wei,Liu Xin,Gong Yihong. Document Clustering Based on Non-negative Matrix Factorization // Proc of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Toronto,Canada,2003: 267-273
[6] Seung D,Lee L. Algorithms for Non-Negative Matrix Factorization // Dietterich T G,Becker S,Ghahramani Z,eds. Advances in Neural Information Processing Systems. Cambridge,USA: MIT Press,2001,XIV: 556-562
[7] Wang Guodong. Similarity Matrix and Spectral Clustering. Master Dissertation. Beijing,China: Beijing Jiaotong University,2009 (in Chinese)
(王国栋.相似矩阵与谱聚类.硕士学位论文.北京:北京交通大学,2009)
[8] Cai Deng,He Xiaofei,Han Jiawei. Document Clustering Using Locality Preserving Indexing. IEEE Trans on Knowledge and Data Engineering,2005,17(12): 1624-1637
[9] Bach F R,Jordan M I. Learning Spectral Clustering with Application to Speech Separation. Journal of Machine Learning Research,2006,7: 1963-2001
[10] Duda R O,Hart P E,Stork D G. Pattern Classification. 2nd Edition. New York,USA: John Wiley Sons,2000
[11] Ding Chris,He Xiaofeng. k-means Clustering via Principal Component Analysis // Proc of the 21st International Conference on Machine learning. Alberta,Canada,2004: 29
[12] Wang Mingwen,Fu Jianbo,Luo Yuansheng,et al. Two-Stage Text Clustering Based on Collaborative Clustering. Pattern Recognition and Artificial Intelligence,2009,22(6): 848-853 (in Chinese)
(王明文,付剑波,罗远胜,等.基于协同聚类的两阶段文本聚类方法.模式识别与人工智能,2009,22(6): 848-853)
[13] Vichi M,Saporta G. Clustering and Disjoint Principal Component Analysis. Computational Statistics Data Analysis,2009,53(8): 3194-3208