一种有限混合模型对无监督文本聚类的广义方法<sup>*</sup>

摘要
图/表
参考文献
相关文章 (15)

全文: PDF (371 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要提出一种有限混合模型对无监督文本聚类的广义方法.它将特征对各混合成员的相关性作为隐变量引入混合模型，在一个统一框架中完成混合模型的模型选择、特征选择以及参数估计.在大规模文本数据集上的实验结果表明该方法在模型选择、特征选择和聚类结果3个方面都取得较好效果.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	张亮
	李敏强

关键词 ：有限混合, 无监督学习, 文本聚类, 特征选择, 模型选择, 期望-最大化算法

Abstract：A generalized method is presented for unsupervised text clustering. The relevance of the features to the mixture components is introduced to the mixture model as a set of latent variables. Then the model selection, feature selection and parameter estimation of the mixture model are integrated into one general framework. Experimental results on four large scale document datasets show that the proposed method achieves fine results in model selection, feature selection and clustering performance.

Key words： Finite Mixtures Unsupervised Learning Document Clustering Feature Selection Model Selection Expectation-Maximization Algorithm

收稿日期: 2006-07-24

ZTFLH:

TP181

基金资助:国家自然科学基金项目(No.70571057)、新世纪优秀人才支持计划项目(No.NECT-05-R013)资助

作者简介: 张亮，男，1979年生，博士研究生，主要研究方向为信息检索与信息过滤、人工智能与机器学习.Email:zhangliang.tju@gmail.com.李敏强，男，1965年生，教授，主要研究方向为信息系统与系统工程、进化计算与人工智能.

引用本文:

张亮，李敏强. 一种有限混合模型对无监督文本聚类的广义方法^*[J]. 模式识别与人工智能, 2007, 20(5): 698-703. ZHANG Liang, LI Min-Qiang. A Generalized Method for Unsupervised Text Clustering Using Finite Mixture Models. , 2007, 20(5): 698-703.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/ 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2007/V20/I5/698

[1] Liu Xin, Gong Yihong, Xu Wei, et al. Document Clustering with Cluster Refinement and Model Selection Capabilities // Proc of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Tampere, Finland, 2002: 191-198
[2] Nigam K, McCallum A K, Thrun S, et al. Text Classification from Labeled and Unlabeled Documents Using EM. Machine Learning, 2000, 39(2/3): 103-134
[3] Yang Y, Pedersen J O. A Comparative Study on Feature Selection in Text Categorization // Proc of the 14th International Conference on Machine Learning. Nashville, USA, 1997: 412-420
[4] Law M H C, Figueiredo M A T, Jain A K. Simultaneous Feature Selection and Clustering Using Mixture Models. IEEE Trans on Pattern Analysis and Machine Intelligence, 2004, 26(9): 1154-1166
[5] Schwarz G. Estimating the Dimension of a Model. Annals of Statistics, 1978, 6(2): 461-464
[6] Akaike H. A New Look at the Statistical Model Identification. IEEE Trans on Automatic Control, 1974, 19(6): 716-723
[7] Dempster A P, Laird N M, Rubin D B. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society: Series B, 1977, 39(1): 1-38
[8] Biernacki C, Celeux G, Govaert G. Strategies for Getting the Highest Likelihood in Mixture Models [EB/OL]. [20010920]. http: //inria.ccsd.cnrs.fr/view_by_stamp.php?label=INRIA-RRRT&langue-en&action_todo=view&id-inria-0072333&version=1#
[9] van Rijsbergen C J. Information Retrieval. London, UK: Butterworths, 1979
[10] Strehl A, Ghosh J. Cluster Ensembles－A Knowledge Reuse Framework for Combining Partitions. Journal of Machine Learning Research, 2002, 3(3): 583-617
[11] Ng A Y, Jordan M I, Weiss Y. On Spectral Clustering: Analysis and an Algorithm // Dietterich T G, Becker S, Ghahramani Z, eds. Advances in Neural Information Processing Systems. Cambridge, USA: MIT Press, 2001, 14: 849-856
[12] Schapire R E, Freund Y, Bartlett P, et al. Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods. Annals of Statistics, 1998, 26(5):1651-1686