Discovering News Topics from Microblogs Based on Hidden Topics Analysis and Text Clustering |
LU Rong, XIANG Liang, LIU Ming-Rong, YANG Qing |
National Laboratory of Pattern Recognition,Institute of Automation,Chinese Academy of Sciences,Beijing 100190 |
Abstract A method of news topics extraction from large-scale short posts of microblogging-service is proposed. Through the hidden topic analysis, the similarity measurement of short texts is solved well. In every time window, the short posts which are most likely to talk about news events are selected according to the characteristics of the news. Then, a two-level K-means-hierarchical hybrid clustering method is used to cluster all the selected data into different news topics. The experimental results show the proposed method works well on large-scale microblog dataset.
Received: 13 October 2010
[1] Bollegala D,Matsuo Y,Ishizuka M.Measuring Semantic Similarity between Words Using Web Search Engines // Proc of the 16th International Conference on World Wide Web.Banff,Canada,2007: 757-766 [2] Sahami M,Heilman T D.A Web-Based Kernel Function for Measuring the Similarity of Short Text Snippets // Proc of the 15th International Conference on World Wide Web.Edinburgh,UK,2006: 377-386 [3] Blei D M,Ng A Y,Jordan M I.Latent Dirichlet Allocation.Journal of Machine Learning Research,2003,3: 993-1022 [4] Heinrich G.Parameter Estimation for Text Analysis [EB/OL].[2010-8-10].http://www.arbylon.net/publications/text-est.pdf [5] Griffiths T L,Steyvers M.Finding Scientific Topics.Proc of the National Academy of Sciences of the United States of America,2004,101(z1): 5228-5235 [6] Deerwester S,Dumais S T,Furnas G W,et al.Indexing by Latent Semantic Analysis.Journal of the American Society of Information Science,1990,41(6): 391-407 [7] Hofmann T.Probabilistic Latent Semantic Analysis // Proc of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.Berkeley,USA,1999: 289-296 [8] Luo Weihua,Liu Qun,Cheng Xueqi.Development and Analysis of Technology of Topic Detection and Tracking // Proc of the 7th Joint Symposium on Computational Linguistics.Harbin,China,2003: 560-566(in Chinese) (骆卫华,刘 群,程学旗.话题检测与跟踪技术的发展与研究//全国第七届计算语言学联合学术会议论文集.哈尔滨,2003: 560-566) [9] Luo Weihua,Yu Manquan,Xu Hongbo,et al.The Study of Topic Detection Based on Algorithm of Division and Multi-Level Clustering with Multi-Strategy Optimization.Journal of Chinese Information Processing,2006,20(1):29-36(in Chinese) (骆卫华,于满泉,许洪波,等.基于多策略优化的分治多层聚类算法的话题发现研究.中文信息学报,2006,20(1): 29-36) [10] Hong Yu ,Zhang Yu,Liu Ting,et al.Topic Detection and Tracking Review.Journal of Chinese Information Processing,2007,21(6): 71-87(in Chinese) (洪 宇,张 宇,刘 挺,等.话题检测与跟踪的评测及研究综述.中文信息学报,2007,21(6): 71-87) [11] Allan J,Papka R,Lavrenko V.On-Line New Event Detection and Tracking // Proc of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.Melbourne,Australia,1998: 37-45 [12] Allan J,Carbonell J,Doddington G,et al.Topic Detection and Tracking Pilot Study Final Report // Proc of the DARPA Broadcast News Transcription and Understanding Workshop.Landsdowne,USA,1998: 194-218 [13] Phan X H,Nguyen L M,Horiguchi S.Learning to Classify Short and Sparse Text Web with Hidden Topics from Large-Scale Data Collections // Proc of the 17th International Conference on World Wide Web.Beijing,China,2008: 91-100 |