Abstract:Topic models have become important tools in machine learning and natural language processing, which can discover hidden topics in large-scale corpus. However, as the size of the corpus grows, the scale of discovered topics grows. Most topic models are on the basis of bag-of-words model, and the orders between terms cannot be described, which makes topics undistinguishable from each other. Ranking topic models without query framework is proposed in this paper, in which topics are ranked to get ordered topic list according to their relationships. Topic relationships are used to evaluate topic influence in topic level, and term significance is used to evaluate term importance in term level and popular ranking topics with little semantics are weakened. Since there is no acknowledged evaluation criterion in ranking topic model, ranked topics are used as features to perform automatic summarization of multi-document, and the performance of ranking topic models are indirectly measured by summarization performance. The experimental results show that ranking topic models outperform topic models without ranking.
[1] Blei D M. Probabilistic Topic Models. Communications of the ACM, 2012, 55(4): 77-84 [2] Robertson S E. The Probability Ranking Principle in IR. Journal of Documentation, 1977, 33(4): 294-304 [3] Jones K S, Walker S, Robertson S E. A Probabilistic Model of Information Retrieval: Development and Comparative Experiments Part1. Information Processing & Management, 2000, 36(6): 779-808 [4] Robertson S, Zaragoza H. The Probabilistic Relevance Framework: BM25 and Beyond. Foundation and Trends in Information Retrieval, 2009, 3(4): 333-389 [5] Estivill-Castro V. Why So Many Clustering Algorithms: A Position Paper. ACM SIGKDD Explorations Newsletter, 2002, 4(1): 65-75 [6] AlSumait L, Barbará D, Gentle J, et al. Topic Significance Ranking of LDA Generative Models // Proc of the European Conference on Machine Learning and Knowledge Discovery in Databases. Bled, Slovenia, 2009: 67-82 [7] Lau J H, Newman D, Karimi S, et al. Best Topic Word Selection for Topic Labelling // Proc of the 23rd International Conference on Computational Linguistics: Posters. Beijing, China, 2010: 605-613 [8] Song Y Q, Pan S M, Liu S X, et al. Topic and Keyword Re-ranking for LDA-Based Topic Modeling // Proc of the 18th ACM Conference on Information and Knowledge Management. Hong Kong, China, 2009: 1757-1760 [9] Duan D S, Li Y H, Li R X, et al. RankTopic: Ranking Based Topic Modeling // Proc of the 12th IEEE International Conference on Data Mining. Brussels, Belgium, 2012: 211-220 [10] Sun Y Z, Han J W, Gao J, et al. iTopicModel: Information Network-Integrated Topic Modeling // Proc of the 9th IEEE International Conference on Data Mining. Miami, USA, 2009: 493-502 [11] Bougouin A, Boudin F, Daille B. TopicRank: Graph-Based Topic Ranking for Keyphrase Extraction // Proc of the 6th International Joint Conference on Natural Language Processing. Nagoya, Japan, 2013: 543-551 [12] Daud A, Li J Z, Zhou L Z, et al. Knowledge Discovery through Directed Probabilistic Topic Models: A Survey. Frontiers of Computer Science in China, 2010, 4(2): 280-301 [13] Srivastava A, Sahami M. Text Mining: Classification, Clustering, and Applications. Boca Raton, USA: CRC Press, 2009 [14] Blei D M, Lafferty J D. Correlated Topic Models // Proc of the Advances in Neural Information Processing Systems 18. Vancouver, Canada, 2005: 113-120 [15] Blei D M, Lafferty J D. A Correlated Topic Model of Science. TheAnnals of Applied Statistics, 2007, 1(1): 17-35 [16] Li W, McCallum A. Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations // Proc of the 23rd International Conference on Machine Learning. Pittsburgh, USA, 2006: 577-584 [17] Mimno D, Li W, McCallum A. Mixtures of Hierarchical Topics with Pachinko Allocation // Proc of the 24th International Conference on Machine Learning. Corvallis, USA, 2007: 633-640 [18] Li W, Wang X R, McCallum A. A Continuous-Time Model of Topic Co-occurrence Trends [EB/OL].[2013-4-15]. http://oai.dtic.mil/oai/oai?verb=getRecord&metadataPrefix=html&identifier=ADA449612 [19] Wang X R, McCallum A. Topics Over Time: A Non-Markov Continuous-Time Model of Topical Trends // Proc of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Philadelphia, USA, 2006: 424-433 [20] Griffiths T L, Steyvers M. Finding Scientific Topics. Proceedings of the National Academy of Sciences of the United States of America, 2004, 101(Supplement 1): 5228-5235 [21] Hall D, Jurafsky D, Manning C D. Studying the History of Ideas Using Topic Models // Proc of the Conference on Empirical Methods in Natural Language Processing. Honolulu, USA, 2008: 363-371 [22] Pruteanu-Malinici I, Ren L, Paisley J, et al. Hierarchical Bayesian Modeling of Topics in Time-Stamped Documents. IEEE Trans on Pattern Analysis and Machine Intelligence, 2010, 32(6): 996-1011 [23] Blei D M, Lafferty J D. Dynamic Topic Models // Proc of the 23rd International Conference on Machine Learning. Pittsburgh, USA, 2006: 113-120 [24] Teh Y W, Jordan M I, Beal M J, et al. Hierarchical Dirichlet Processes. Journal of the American Statistical Association, 2006, 101(476): 1566-1581 [25] Blei D M, Jordan M I. Variational Inference for Dirichlet Process Mixtures. Bayesian Analysis, 2006, 1(1): 121-144 [26] Chueh C H, Chien J T. Segmented Topic Model for Text Classification and Speech Recognition [EB/OL].[2013-4-21]. http://www.umiacs.umd.edu/~jbg/nips_tm_workshop/7.pdf [27] Du L, Buntine W, Jin H D. A Segmented Topic Model Based on the Two-Parameter Poisson-Dirichlet Process. Machine Learning, 2010, 81(1): 5-19 [28] Chang J, Blei D M. Hierarchical Relational Models for Document Networks. The Annals of Applied Statistics, 2010, 4(1): 124-150 [29] Dan L, Buntine W L, Jin H D. Sequential Latent Dirichlet Allocation: Discover Underlying Topic Structures within a Document // Proc of the 10th IEEE International Conference on Data Mining. Sydney, Australia, 2010: 148-157 [30] Chang J, Blei D M. Relational Topic Models for Document Networks [EB/OL].[2013-4-25]. www.cs.princeton.edu/~blei/papers/ChangBlei2009.pdf [31] Lin C Y. Rouge: A Package for Automatic Evaluation of Summaries // Proc of the Workshop on Text Summarization Branches Out, Post-Conference Workshop of ACL. Barcelona, Spain, 2004: 74-81 [32] Haghighi A, Vanderwende L. Exploring Content Models for Multi-Document Summarization // Proc of HCT-NAACL 2009. Boulder, USA, 2009: 362-370 [33] Arora R, Ravindran B. Latent Dirichlet Allocation and Singular Value Decomposition Based Multi-document Summarization // Proc of the 8th IEEE International Conference on Data Mining. Pisa, Italy, 2008: 713-718 [34] Nenkova A, McKeown K. Automatic Summarization. Foundations and Trends in Information Retrieval, 2011, 5(2/3): 103-233 [35] Harabagiu S M, Lctusu F. Generating Single and Multi-document Summaries with GISTEXTER // Proc of the Workshop on Automatic Summarization. Philadelphia, USA, 2002: 30-38 [36] Van Halteren H. Writing Style Recognition and Sentence Extraction // Proc of the ACL Workshop on Text Summarization. Philadelphia, USA, 2002: 66-70