|
|
A Topic Model Based on CRP and Word Similarity |
ZHANG Xiao-Ping1,ZHOU Xue-Zhong1,HUANG Hou-Kuan1,FENG Qi1,CHEN Shi-Bo2 |
1.School of Computer and Information Technology,Beijing Jiaotong University,Beijing 100044 2.Guanganmen Hospital,China Academy of Chinese Medical Sciences,Beijing 100053 |
|
|
Abstract The topic model can extract the topics hided in documents to make the dimensions of documents reduced and the documents be classified and retrieved according to their topics. It is a research focus on information classification and retrieval fields. Aiming at the problem that the number of topics cannot be automatically determined in LDA topic model, a latent topic model is proposed by combining the similarity between words and Chinese restaurant process (CRP). It can adaptively update the contents and determine the rational number of topics. Meanwhile, a novel method of setting the hyperparameters during updating topics is put forward. The experimental results on traditional Chinese medicine (TCM) clinical dataset show the proposed model has good analysis results accepted by TCM expert.
|
Received: 27 April 2009
|
|
|
|
|
[1] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation. Journal of Machine Learning Research, 2003, 3: 993-1022 [2] Griffiths T L, Steyvers M. A Probabilistic Approach to Semantic Representation // Proc of the 24th Annual Conference of the Cognitive Science Society. Fairfax, USA, 2002: 381-386 [3] Griffiths T L, Steyvers M. Prediction and Semantic Association // Becker S, Thrun S, Obermayer K, eds. Advance in Neural Information Processing Systems. Cambridge, USA: MIT Press, 2003, 15: 11-18 [4] Griffiths T L, Steyvers M. Finding Scientific Topics. Proc of the National Academy of Science, 2004, 101(Z1): 5228-5235 [5] Hofmann T. Probabilistic Latent Semantic Analysis // Proc of the 15th Conference on Uncertainty in Artificial Intelligence. Stockholm, Sweden, 1999: 289-296 [6] Hofmann T. Probabilistic Latent Semantic Indexing // Proc of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Berkeley, USA, 1999: 50-57 [7] Hofmann T. Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning. 2001, 42(1/2): 177-196 [8] Banerjee S, Pedersen T. The Design, Implementation and Use of the Ngram Statistics Package // Proc of the 4th International Conference on Intelligent Text Processing and Computational Linguistics. Mexico, Mexico, 2003: 370-381 [9] Nigam K, McCallum A, Thrun S. et al. Text Classification from Labeled and Unlabeled Documents Using EM. Machine Learning, 2000, 39(2/3): 103-134 [10] Blei D, Griffiths T, Jordan M, et al. Hierarchical Topic Models and the Nested Chinese Restaurant Process // Thrun S, Saul L K, Schlkopf B, eds. Advances in Neural Information Processing Systems. Cambridge, USA: MIT Press, 2004, 16: 17-24 [11] Li Wei, McCallum A. Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations // Proc of the 23rd International Conference on Machine Learning. New York, USA, 2006: 577-584 [12] Blei D, Lafferty J. Correlated Topic Models // Weiss Y, Schlkopf B, Platt J, eds. Advances in Neural Information Processing Systems. Cambridge, USA: MIT Press, 2006, 18: 147-154 [13] Blei D, Lafferty J. A Correlated Topic Model Science. The Annals of Applied Statistics, 2007, 1(1): 17-35 [14] Blei D, McAuliffe J. Supervised Topic Models // Platt J C, Koller D, Singer Y, eds. Advances in Neural Information Processing Systems. Cambridge, USA: MIT Press, 2008, 20: 121-128 [15] Rosen-Zvi M, Griffiths T, Steyvers M, et al. The Author-Topic Model for Authors and Documents // Proc of the 20th Conference on Uncertainty in Artificial Intelligence. Banff, Canada, 2004: 487-494 [16] Bhattacharya I, Getoor I. A Latent Dirichlet Model for Unsupervised Entity Resolution // Proc of the International Conference on Data Mining. New York, USA, 2006: 47-58 [17] Li Wenbo, Sun Le, Zhang Dakun. Text Classification Based on Labeled-LDA Model. Chinese Journal of Computers, 2008, 31(4): 620-627 (in Chinese) (李文波,孙 乐,张大鲲.基于Labeled-LDA模型的文本分类新算法.计算机学报, 2008, 31(4): 620-627) [18] Cao Juan, Zhang Yongdong, Li Jintao, et al. A Method of Adaptively Selecting Best LDA Model Based on Density. Chinese Journal of Computers, 2008, 31(10):1780-1787 (in Chinese) (曹 娟,张勇东,李锦涛,等.一种基于密度的自适应最优LDA模型选择方法.计算机学报, 2008, 31(10): 1780-1787) [19] Shi Jin, Hu Ming, Shi Xin, et al. Text Segmentation Based on Model LDA. Chinese Journal of Computers, 2008, 31(10): 1865-1873 (in Chinese) (石 晶,胡 明,石 鑫,等.基于LDA模型的文本分割.计算机学报, 2008, 31(10): 1865-1873) [20] McCallum A, Corrada-Emmanuel A, Wang Xuerui. Topic and Role Discovery in Social Networks with Experiments on Enron and Academic Email. Journal of Artificial Intelligence Research, 2007, 30: 249-272 [21] Flaherty P, Giaever G, Kumm J, et al. A Latent Variable Model for Chemogenomic Profiling. Bioinformatics, 2005, 21(15): 3286-3293 [22] Steyvers M, Griffiths T. Probabilistic Topic Models // Landauer T, McNamara D, Dennis S, et al, eds. Handbook of Latent Semantic Analysis. Hillsdale, USA: Erlbaum, 2007: 427-448 [23] Aldous D. Exchangeability and Related Topics. Berlin, Germany: Springer Press, 1985: 1-198 [24] Zhou Xuezhong, Liu Baoyan, Wang Yinghui, et al. Building Clinical Data Warehouse for Traditional Chinese Medicine Knowledge Discovery // Proc of the International Conference on BioMedical Engineering and Informatics. Sanya, China, 2008: 615-620 |
|
|
|