Document Clustering Based on Constrained Principal Component Analysis
WANG Ming-Wen1,YE Hao2,ZUO Jia-Li3
1.School of Computer and Information Engineering,Jiangxi Normal University,Nanchang 330022 2.School of Computer Science,Fudan University,Shanghai 201203 3.School of Elementary Education,Jiangxi Normal University,Nanchang 330027
Abstract:Principal component analysis is an effective method to improve the performance of clustering in high dimension. On the other hand,principal component analysis is easy to lose the components which benefits for clustering. In order to preserve these beneficial components,an iteration algorithm of dimensionality reduction and clustering,named constrained principal component clustering,is proposed. Each iteration step can be represented as a constrained optimization problem which has a analytical solution. This iterative clustering algorithm is called document clustering based on constrained principal component analysis. The experimental results on Reuter21578 and WebKB show that the proposed algorithm outperforms to k-means,Non-Negative Matrix Decomposition and Spectral Clustering.
[1] Jain A K,Murty M N,Flynn P J. Data Clustering: A Review. ACM Computing Surveys (CSUR),1999,31(3): 264-323 [2] Ding Shifei,Shi Zhongzhi,Jin Fengxiang,et al. A Direct Clustering Algorithm Based on Generalized Information Distance. Journal of Computer Research and Development,2007,44(4): 674-679 (in Chinese) (丁世飞,史忠植,靳奉祥,等.基于广义信息距离的直接聚类算法.计算机研究与发展,2007,44(4): 674-679) [3] Li Yujian. An Adaptive k-means Clustering Algorithm. Journal of Computer Research and Development,2007,44(22): 100-104 (in Chinese) (李玉鑑.自适应k-均值聚类算法.计算机研究与发展,2007,44(22): 100-104) [4] Bishop C M. Pattern Recognition and Machine Learning. New York,USA: Springer-Verlag,2006 [5] Xu Wei,Liu Xin,Gong Yihong. Document Clustering Based on Non-negative Matrix Factorization // Proc of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Toronto,Canada,2003: 267-273 [6] Seung D,Lee L. Algorithms for Non-Negative Matrix Factorization // Dietterich T G,Becker S,Ghahramani Z,eds. Advances in Neural Information Processing Systems. Cambridge,USA: MIT Press,2001,XIV: 556-562 [7] Wang Guodong. Similarity Matrix and Spectral Clustering. Master Dissertation. Beijing,China: Beijing Jiaotong University,2009 (in Chinese) (王国栋.相似矩阵与谱聚类.硕士学位论文.北京:北京交通大学,2009) [8] Cai Deng,He Xiaofei,Han Jiawei. Document Clustering Using Locality Preserving Indexing. IEEE Trans on Knowledge and Data Engineering,2005,17(12): 1624-1637 [9] Bach F R,Jordan M I. Learning Spectral Clustering with Application to Speech Separation. Journal of Machine Learning Research,2006,7: 1963-2001 [10] Duda R O,Hart P E,Stork D G. Pattern Classification. 2nd Edition. New York,USA: John Wiley Sons,2000 [11] Ding Chris,He Xiaofeng. k-means Clustering via Principal Component Analysis // Proc of the 21st International Conference on Machine learning. Alberta,Canada,2004: 29 [12] Wang Mingwen,Fu Jianbo,Luo Yuansheng,et al. Two-Stage Text Clustering Based on Collaborative Clustering. Pattern Recognition and Artificial Intelligence,2009,22(6): 848-853 (in Chinese) (王明文,付剑波,罗远胜,等.基于协同聚类的两阶段文本聚类方法.模式识别与人工智能,2009,22(6): 848-853) [13] Vichi M,Saporta G. Clustering and Disjoint Principal Component Analysis. Computational Statistics Data Analysis,2009,53(8): 3194-3208