查询无关排序主题模型<sup>*</sup>

摘要
图/表
参考文献
相关文章 (3)

全文: PDF (474 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要主题模型已成为机器学习和自然语言处理等领域研究的重要工具，它可发现大规模语料库中的隐含主题.随着语料库规模增大，发现的主题规模也随之增大.绝大多数主题模型以词袋模型为基础，无法描述词项间的顺序关系，使得主题之间无法按照重要性区分.文中提出查询无关排序主题模型框架，利用主题间各种关系排序主题，得到有序主题列表.主题关系从主题层面评价主题影响度，继而提出词项贡献度，从词项语义层面评价主题，削弱流行但语义空泛的排序主题.由于排序主题模型尚未有公认的评价标准，将有序主题作为特征进行多文档自动文摘生成，通过文摘效果间接评价主题排序的效果.实验结果证明有序主题模型优于非排序主题模型的结果.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	肖智博
	车丰
	吴镝
	李庆丰
	鲁明羽

关键词 ：排序主题模型, 主题模型评价, 多文档自动文摘, 抽取性文摘, 文摘句排序

Abstract：Topic models have become important tools in machine learning and natural language processing, which can discover hidden topics in large-scale corpus. However, as the size of the corpus grows, the scale of discovered topics grows. Most topic models are on the basis of bag-of-words model, and the orders between terms cannot be described, which makes topics undistinguishable from each other. Ranking topic models without query framework is proposed in this paper, in which topics are ranked to get ordered topic list according to their relationships. Topic relationships are used to evaluate topic influence in topic level, and term significance is used to evaluate term importance in term level and popular ranking topics with little semantics are weakened. Since there is no acknowledged evaluation criterion in ranking topic model, ranked topics are used as features to perform automatic summarization of multi-document, and the performance of ranking topic models are indirectly measured by summarization performance. The experimental results show that ranking topic models outperform topic models without ranking.

Key words： Ranking Topic Models Evaluation of Topic Models Multi-document Summarization Extractive Summarization Sentence Ranking

收稿日期: 2013-05-26

ZTFLH:

TP391.2

基金资助:国家自然科学基金项目(No.61370070,61272369,61301185,61300082)、大连市科技计划项目(No.2011A17GX073,2013J21D
W006)、中央高校基本科研业务费专项资金项目(No.3132013335)资助

作者简介: 肖智博，男，1984年生，博士研究生，主要研究方向为主题模型、多文档自动文摘、信息检索、机器学习.E-mail:xiaozhibo@dlmu.edu.cn.车丰，男，1989年生，硕士研究生，主要研究方向为多文档自动文摘.吴镝，男，1978年生，博士研究生，主要研究方向为图像检索、机器学习.李庆丰，男，1987年生，硕士研究生，主要研究方向为主题模型和多文档自动文摘.鲁明羽(通讯作者)，男，1963年生，教授，博士生导师，主要研究方向为机器学习、数据挖掘、数据仓库.E-mail:lumingyu@dlmu.edu.cn.

引用本文:

肖智博，车丰，吴镝，李庆丰，鲁明羽. 查询无关排序主题模型^*[J]. 模式识别与人工智能, 2014, 27(7): 623-630. XIAO Zhi-Bo, CHE Feng, WU Di, LI Qing-Feng, LU Ming-Yu. Ranking Topic Models without Query. , 2014, 27(7): 623-630.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/ 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2014/V27/I7/623

[1] Blei D M. Probabilistic Topic Models. Communications of the ACM, 2012, 55(4): 77-84
[2] Robertson S E. The Probability Ranking Principle in IR. Journal of Documentation, 1977, 33(4): 294-304
[3] Jones K S, Walker S, Robertson S E. A Probabilistic Model of Information Retrieval: Development and Comparative Experiments Part1. Information Processing & Management, 2000, 36(6): 779-808
[4] Robertson S, Zaragoza H. The Probabilistic Relevance Framework: BM25 and Beyond. Foundation and Trends in Information Retrieval, 2009, 3(4): 333-389
[5] Estivill-Castro V. Why So Many Clustering Algorithms: A Position Paper. ACM SIGKDD Explorations Newsletter, 2002, 4(1): 65-75
[6] AlSumait L, Barbará D, Gentle J, et al. Topic Significance Ranking of LDA Generative Models // Proc of the European Conference on Machine Learning and Knowledge Discovery in Databases. Bled, Slovenia, 2009: 67-82
[7] Lau J H, Newman D, Karimi S, et al. Best Topic Word Selection for Topic Labelling // Proc of the 23rd International Conference on Computational Linguistics: Posters. Beijing, China, 2010: 605-613
[8] Song Y Q, Pan S M, Liu S X, et al. Topic and Keyword Re-ranking for LDA-Based Topic Modeling // Proc of the 18th ACM Conference on Information and Knowledge Management. Hong Kong, China, 2009: 1757-1760
[9] Duan D S, Li Y H, Li R X, et al. RankTopic: Ranking Based Topic Modeling // Proc of the 12th IEEE International Conference on Data Mining. Brussels, Belgium, 2012: 211-220
[10] Sun Y Z, Han J W, Gao J, et al. iTopicModel: Information Network-Integrated Topic Modeling // Proc of the 9th IEEE International Conference on Data Mining. Miami, USA, 2009: 493-502
[11] Bougouin A, Boudin F, Daille B. TopicRank: Graph-Based Topic Ranking for Keyphrase Extraction // Proc of the 6th International Joint Conference on Natural Language Processing. Nagoya, Japan, 2013: 543-551
[12] Daud A, Li J Z, Zhou L Z, et al. Knowledge Discovery through Directed Probabilistic Topic Models: A Survey. Frontiers of Computer Science in China, 2010, 4(2): 280-301
[13] Srivastava A, Sahami M. Text Mining: Classification, Clustering, and Applications. Boca Raton, USA: CRC Press, 2009
[14] Blei D M, Lafferty J D. Correlated Topic Models // Proc of the Advances in Neural Information Processing Systems 18. Vancouver, Canada, 2005: 113-120
[15] Blei D M, Lafferty J D. A Correlated Topic Model of Science. TheAnnals of Applied Statistics, 2007, 1(1): 17-35
[16] Li W, McCallum A. Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations // Proc of the 23rd International Conference on Machine Learning. Pittsburgh, USA, 2006: 577-584
[17] Mimno D, Li W, McCallum A. Mixtures of Hierarchical Topics with Pachinko Allocation // Proc of the 24th International Conference on Machine Learning. Corvallis, USA, 2007: 633-640
[18] Li W, Wang X R, McCallum A. A Continuous-Time Model of Topic Co-occurrence Trends [EB/OL].[2013-4-15]. http://oai.dtic.mil/oai/oai?verb=getRecord&metadataPrefix=html&identifier=ADA449612
[19] Wang X R, McCallum A. Topics Over Time: A Non-Markov Continuous-Time Model of Topical Trends // Proc of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Philadelphia, USA, 2006: 424-433
[20] Griffiths T L, Steyvers M. Finding Scientific Topics. Proceedings of the National Academy of Sciences of the United States of America, 2004, 101(Supplement 1): 5228-5235
[21] Hall D, Jurafsky D, Manning C D. Studying the History of Ideas Using Topic Models // Proc of the Conference on Empirical Methods in Natural Language Processing. Honolulu, USA, 2008: 363-371
[22] Pruteanu-Malinici I, Ren L, Paisley J, et al. Hierarchical Bayesian Modeling of Topics in Time-Stamped Documents. IEEE Trans on Pattern Analysis and Machine Intelligence, 2010, 32(6): 996-1011
[23] Blei D M, Lafferty J D. Dynamic Topic Models // Proc of the 23rd International Conference on Machine Learning. Pittsburgh, USA, 2006: 113-120
[24] Teh Y W, Jordan M I, Beal M J, et al. Hierarchical Dirichlet Processes. Journal of the American Statistical Association, 2006, 101(476): 1566-1581
[25] Blei D M, Jordan M I. Variational Inference for Dirichlet Process Mixtures. Bayesian Analysis, 2006, 1(1): 121-144
[26] Chueh C H, Chien J T. Segmented Topic Model for Text Classification and Speech Recognition [EB/OL].[2013-4-21]. http://www.umiacs.umd.edu/~jbg/nips_tm_workshop/7.pdf
[27] Du L, Buntine W, Jin H D. A Segmented Topic Model Based on the Two-Parameter Poisson-Dirichlet Process. Machine Learning, 2010, 81(1): 5-19
[28] Chang J, Blei D M. Hierarchical Relational Models for Document Networks. The Annals of Applied Statistics, 2010, 4(1): 124-150
[29] Dan L, Buntine W L, Jin H D. Sequential Latent Dirichlet Allocation: Discover Underlying Topic Structures within a Document // Proc of the 10th IEEE International Conference on Data Mining. Sydney, Australia, 2010: 148-157
[30] Chang J, Blei D M. Relational Topic Models for Document Networks [EB/OL].[2013-4-25]. www.cs.princeton.edu/~blei/papers/ChangBlei2009.pdf
[31] Lin C Y. Rouge: A Package for Automatic Evaluation of Summaries // Proc of the Workshop on Text Summarization Branches Out, Post-Conference Workshop of ACL. Barcelona, Spain, 2004: 74-81
[32] Haghighi A, Vanderwende L. Exploring Content Models for Multi-Document Summarization // Proc of HCT-NAACL 2009. Boulder, USA, 2009: 362-370
[33] Arora R, Ravindran B. Latent Dirichlet Allocation and Singular Value Decomposition Based Multi-document Summarization // Proc of the 8th IEEE International Conference on Data Mining. Pisa, Italy, 2008: 713-718
[34] Nenkova A, McKeown K. Automatic Summarization. Foundations and Trends in Information Retrieval, 2011, 5(2/3): 103-233
[35] Harabagiu S M, Lctusu F. Generating Single and Multi-document Summaries with GISTEXTER // Proc of the Workshop on Automatic Summarization. Philadelphia, USA, 2002: 30-38
[36] Van Halteren H. Writing Style Recognition and Sentence Extraction // Proc of the ACL Workshop on Text Summarization. Philadelphia, USA, 2002: 66-70