|
|
A Generalized Method for Unsupervised Text Clustering Using Finite Mixture Models |
ZHANG Liang, LI Min-Qiang |
School of Management, Tianjin University, Tianjin 300072 |
|
|
Abstract A generalized method is presented for unsupervised text clustering. The relevance of the features to the mixture components is introduced to the mixture model as a set of latent variables. Then the model selection, feature selection and parameter estimation of the mixture model are integrated into one general framework. Experimental results on four large scale document datasets show that the proposed method achieves fine results in model selection, feature selection and clustering performance.
|
Received: 24 July 2006
|
|
|
|
|
[1] Liu Xin, Gong Yihong, Xu Wei, et al. Document Clustering with Cluster Refinement and Model Selection Capabilities // Proc of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Tampere, Finland, 2002: 191-198 [2] Nigam K, McCallum A K, Thrun S, et al. Text Classification from Labeled and Unlabeled Documents Using EM. Machine Learning, 2000, 39(2/3): 103-134 [3] Yang Y, Pedersen J O. A Comparative Study on Feature Selection in Text Categorization // Proc of the 14th International Conference on Machine Learning. Nashville, USA, 1997: 412-420 [4] Law M H C, Figueiredo M A T, Jain A K. Simultaneous Feature Selection and Clustering Using Mixture Models. IEEE Trans on Pattern Analysis and Machine Intelligence, 2004, 26(9): 1154-1166 [5] Schwarz G. Estimating the Dimension of a Model. Annals of Statistics, 1978, 6(2): 461-464 [6] Akaike H. A New Look at the Statistical Model Identification. IEEE Trans on Automatic Control, 1974, 19(6): 716-723 [7] Dempster A P, Laird N M, Rubin D B. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society: Series B, 1977, 39(1): 1-38 [8] Biernacki C, Celeux G, Govaert G. Strategies for Getting the Highest Likelihood in Mixture Models [EB/OL]. [20010920]. http: //inria.ccsd.cnrs.fr/view_by_stamp.php?label=INRIA-RRRT&langue-en&action_todo=view&id-inria-0072333&version=1# [9] van Rijsbergen C J. Information Retrieval. London, UK: Butterworths, 1979 [10] Strehl A, Ghosh J. Cluster Ensembles-A Knowledge Reuse Framework for Combining Partitions. Journal of Machine Learning Research, 2002, 3(3): 583-617 [11] Ng A Y, Jordan M I, Weiss Y. On Spectral Clustering: Analysis and an Algorithm // Dietterich T G, Becker S, Ghahramani Z, eds. Advances in Neural Information Processing Systems. Cambridge, USA: MIT Press, 2001, 14: 849-856 [12] Schapire R E, Freund Y, Bartlett P, et al. Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods. Annals of Statistics, 1998, 26(5):1651-1686 |
|
|
|