基于词相似性与CRP的主题模型

摘要
图/表
参考文献
相关文章 (9)

全文: PDF (322 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要主题模型能提取隐含在文档中的主题，使文档可按主题进行归约、分类和检索，成为信息分类和检索领域的研究热点。针对LDA(Latent Dirichlet Allocation)主题模型不能自动确定主题数目的问题，提出一种结合词相似性与CRP(Chinese Restaurant Process )的隐主题模型，可自适应地动态更新主题内容，确定合理的主题数目。同时提出一种在动态更新主题数时超参数设置方法。在中医临床诊疗数据的实验中，获得领域专家解释性较好的分析结果。

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	张小平¹
	周雪忠¹
	黄厚宽¹
	冯奇¹
	陈世波²

关键词 ：主题模型, 词相似性, Dirichlet分布

Abstract：The topic model can extract the topics hided in documents to make the dimensions of documents reduced and the documents be classified and retrieved according to their topics. It is a research focus on information classification and retrieval fields. Aiming at the problem that the number of topics cannot be automatically determined in LDA topic model, a latent topic model is proposed by combining the similarity between words and Chinese restaurant process (CRP). It can adaptively update the contents and determine the rational number of topics. Meanwhile, a novel method of setting the hyperparameters during updating topics is put forward. The experimental results on traditional Chinese medicine (TCM) clinical dataset show the proposed model has good analysis results accepted by TCM expert.

Key words： Topic Model Word Similarity Dirichlet Distribution

收稿日期: 2009-04-27

ZTFLH:

TP391

基金资助:国家973计划项目(No.2006CB504601)、国家科技支撑计划项目(No.2007BA110B06-01)、国家自然科学基金项目(No.90709006)和北京市科学技术委员会科研攻关项目(No.D08050703020804)资助

作者简介: 张小平，女，1969年生，博士研究生，副教授，主要研究方向为人工智能、数据挖掘.E-mail:zh_xping@hotmail.com.周雪忠，男，1977年生，博士，硕士生导师，主要研究方向为数据仓库、数据挖掘、医学本体论与中医信息学.黄厚宽，男，1940年生，教授，博士生导师，主要研究方向为人工智能、数据挖掘、机器学习.冯奇，男，1982年生，博士研究生，主要研究方向为数据挖掘、POMDP.陈世波，男，1973年生，博士，主治医师，主要研究方向为糖尿病及其并发症的中医药防治研究、个体化诊疗及临床评价.

引用本文:

张小平，周雪忠，黄厚宽，冯奇，陈世波. 基于词相似性与CRP的主题模型[J]. 模式识别与人工智能, 2010, 23(1): 72-76. ZHANG Xiao-Ping,ZHOU Xue-Zhong,HUANG Hou-Kuan,FENG Qi,CHEN Shi-Bo. A Topic Model Based on CRP and Word Similarity. , 2010, 23(1): 72-76.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/ 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2010/V23/I1/72

[1] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation. Journal of Machine Learning Research, 2003, 3: 993-1022
[2] Griffiths T L, Steyvers M. A Probabilistic Approach to Semantic Representation // Proc of the 24th Annual Conference of the Cognitive Science Society. Fairfax, USA, 2002: 381-386
[3] Griffiths T L, Steyvers M. Prediction and Semantic Association // Becker S, Thrun S, Obermayer K, eds. Advance in Neural Information Processing Systems. Cambridge, USA: MIT Press, 2003, 15: 11-18
[4] Griffiths T L, Steyvers M. Finding Scientific Topics. Proc of the National Academy of Science, 2004, 101(Z1): 5228-5235
[5] Hofmann T. Probabilistic Latent Semantic Analysis // Proc of the 15th Conference on Uncertainty in Artificial Intelligence. Stockholm, Sweden, 1999: 289-296
[6] Hofmann T. Probabilistic Latent Semantic Indexing // Proc of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Berkeley, USA, 1999: 50-57
[7] Hofmann T. Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning. 2001, 42(1/2): 177-196
[8] Banerjee S, Pedersen T. The Design, Implementation and Use of the Ngram Statistics Package // Proc of the 4th International Conference on Intelligent Text Processing and Computational Linguistics. Mexico, Mexico, 2003: 370-381
[9] Nigam K, McCallum A, Thrun S. et al. Text Classification from Labeled and Unlabeled Documents Using EM. Machine Learning, 2000, 39(2/3): 103-134
[10] Blei D, Griffiths T, Jordan M, et al. Hierarchical Topic Models and the Nested Chinese Restaurant Process // Thrun S, Saul L K, Schlkopf B, eds. Advances in Neural Information Processing Systems. Cambridge, USA: MIT Press, 2004, 16: 17-24
[11] Li Wei, McCallum A. Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations // Proc of the 23rd International Conference on Machine Learning. New York, USA, 2006: 577-584
[12] Blei D, Lafferty J. Correlated Topic Models // Weiss Y, Schlkopf B, Platt J, eds. Advances in Neural Information Processing Systems. Cambridge, USA: MIT Press, 2006, 18: 147-154
[13] Blei D, Lafferty J. A Correlated Topic Model Science. The Annals of Applied Statistics, 2007, 1(1): 17-35
[14] Blei D, McAuliffe J. Supervised Topic Models // Platt J C, Koller D, Singer Y, eds. Advances in Neural Information Processing Systems. Cambridge, USA: MIT Press, 2008, 20: 121-128
[15] Rosen-Zvi M, Griffiths T, Steyvers M, et al. The Author-Topic Model for Authors and Documents // Proc of the 20th Conference on Uncertainty in Artificial Intelligence. Banff, Canada, 2004: 487-494
[16] Bhattacharya I, Getoor I. A Latent Dirichlet Model for Unsupervised Entity Resolution // Proc of the International Conference on Data Mining. New York, USA, 2006: 47-58
[17] Li Wenbo, Sun Le, Zhang Dakun. Text Classification Based on Labeled-LDA Model. Chinese Journal of Computers, 2008, 31(4): 620-627 (in Chinese)
(李文波,孙乐,张大鲲.基于Labeled-LDA模型的文本分类新算法.计算机学报, 2008, 31(4): 620-627)
[18] Cao Juan, Zhang Yongdong, Li Jintao, et al. A Method of Adaptively Selecting Best LDA Model Based on Density. Chinese Journal of Computers, 2008, 31(10):1780-1787 (in Chinese)
(曹娟,张勇东,李锦涛,等.一种基于密度的自适应最优LDA模型选择方法.计算机学报, 2008, 31(10): 1780-1787)
[19] Shi Jin, Hu Ming, Shi Xin, et al. Text Segmentation Based on Model LDA. Chinese Journal of Computers, 2008, 31(10): 1865-1873 (in Chinese)
(石晶,胡明,石鑫,等.基于LDA模型的文本分割.计算机学报, 2008, 31(10): 1865-1873)
[20] McCallum A, Corrada-Emmanuel A, Wang Xuerui. Topic and Role Discovery in Social Networks with Experiments on Enron and Academic Email. Journal of Artificial Intelligence Research, 2007, 30: 249-272
[21] Flaherty P, Giaever G, Kumm J, et al. A Latent Variable Model for Chemogenomic Profiling. Bioinformatics, 2005, 21(15): 3286-3293
[22] Steyvers M, Griffiths T. Probabilistic Topic Models // Landauer T, McNamara D, Dennis S, et al, eds. Handbook of Latent Semantic Analysis. Hillsdale, USA: Erlbaum, 2007: 427-448
[23] Aldous D. Exchangeability and Related Topics. Berlin, Germany: Springer Press, 1985: 1-198
[24] Zhou Xuezhong, Liu Baoyan, Wang Yinghui, et al. Building Clinical Data Warehouse for Traditional Chinese Medicine Knowledge Discovery // Proc of the International Conference on BioMedical Engineering and Informatics. Sanya, China, 2008: 615-620