|
|
Word Embedding Based Chinese News Event Detection and Representation |
ZHANG Bin1, HU Linmei1, HOU Lei1, LI Juanzi1 |
1.Knowledge Engineering Group, Department of Computer Science and Technology, Tsinghua University, Beijing 100084 |
|
|
Abstract Existing methods of event detection are mainly based on traditional TF-IDF document representation with high dimension and sparse semantics, leading to low efficiency and accuracy. Thus, they are not suitable for large-scale online news event detection. A document representation method based on word embedding is proposed in this paper. By the document representation method, the document representation dimension is reduced, the semantic sparse problem is alleviated and the efficiency and accuracy of document similarity calculation are enhanced. Based on the document representation method, a dynamic online clustering method is proposed for online news event detection. Based on the dynamic online clustering method, both the accuracy and the recall of event detection are improved. Experiments on the standard dataset TDT4 and a real dataset show that the proposed adaptive online event detection method significantly improves the performance of event detection in both efficiency and accuracy compared with the state-of-the-art methods.
|
Received: 26 September 2017
|
|
Fund:Supported by National Basic Research Program of China(973 Program)(No.2014CB340504), Key Program of National Natural Science Foundation of China(No.61533018,61661146007), Fund of Online Education Research Center of Ministry of Educa-tion of China(No.2016ZD102), Tsinghua-NUS NEXT Joint Research Center Program |
Corresponding Authors:
HOU Lei, Ph.D.. His research interests include news and user-generated content analysis and semantic web.
|
About author:: ZHANG Bin, master student. His research interests include news mining.HU Linmei, Ph.D.candidate. Her research interests include text mining and natural language processing.LI Juanzi, Ph.D., professor. Her research interests include data mining, semantic web and knowledge graph. |
|
|
|
[1] 中国互联网络信息中心.2016年中国互联网新闻市场研究报告.北京:中国互联网络信息中心, 2016. (China Internet Network Information Center. 2016 China Internet News Market Research Report. Beijing, China: China Internet Network Information Center, 2016.) [2] 王 千,王 成,冯振元,等. K-means聚类算法研究综述.电子设计工程, 2012, 20(7): 21-24. (WANG Q, WANG C, FENG Z Y, et al. Review of K-means Clustering Algorithm. Electronic Design Engineering, 2012, 20(7): 21-24.) [3] 蒋 帅.K-均值聚类算法研究.硕士学位论文.西安:陕西师范大学, 2010. (JIANG S. Research on K-means Clustering Algorithm. Master Dissertation. Xi′an, China: Shaanxi Normal University, 2010.) [4] 乔端瑞.基于K-means算法及层次聚类算法的研究与应用.硕士学位论文.长春:吉林大学, 2016. (QIAO D R. Research and Application Based on K-means Algorithm and Hierarchical Clustering Algorithm. Master Dissertation. Changchun, China: Jilin University, 2016.) [5] 格桑多吉,乔少杰,韩 楠,等.基于Single-Pass的网络舆情热点发现算法.电子科技大学学报, 2015, 44(4): 599-604. (GESANG D J, QIAO S J, HAN N, et al. An Internet Public Opinion Hotspot Detection Algorithm Based on Single-Pass. Journal of University of Electronic Science and Technology of China, 2015, 44(4): 599-604.) [6] 税仪冬,瞿有利,黄厚宽.周期分类和Single-Pass聚类相结合的话题识别与跟踪方法.北京交通大学学报, 2009, 33(5): 85-89. (SHUI Y D, ZHAI Y L, HUANG H K. A New Topic Detection and Tracking Approach Combining Periodic Classification and Single-Pass Clustering. Journal of Beijing Jiaotong University, 2009, 33(5): 85-89.) [7] ALLAN J, PAPKA R, LAVRENKO V. On-line New Event Detection and Tracking // Proc of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM, 1998: 37-45. [8] KUMARAN G, ALLEN J. Text Classification and Named Entities for New Event Detection // Proc of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM, 2004: 297-304. [9] 张 阔,李涓子,吴 刚,等.基于词元再评估的新事件检测模型.软件学报, 2012, 19(4): 817-828. (ZHANG K, LI J Z, WU G, et al. A New Event Detection Model Based on Term Reweighting. Journal of Software, 2008, 19(4): 817-828.) [10] 张小明,李舟军,巢文涵.基于增量型聚类的自动话题检测研究.软件学报, 2012, 23(6): 1578-1587. (ZHANG X M, LI Z J, CHAO W H. Research of Automatic Topic Detection Based on Incremental Clustering. Journal of Software, 2012, 23(6): 1578-1587.) [11] BRANTS T, CHEN F, FARAHAT A. A System for New Event Detection // Proc of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM, 2003: 330-337. [12] 施聪莺,徐朝军,杨晓江.TFIDF算法研究综述.计算机应用, 2009, 29(S1): 167-170, 180. (SHI C Y, XU C J, YANG X J. Study of TFIDF Algorithm. Journal of Computer Applications, 2009, 29(S1): 167-170, 180.) [13] 贾自艳,何 清,张海俊,等.一种基于动态进化模型的事件探测和追踪算法.计算机研究与发展, 2004, 41(7): 1273-1280. (JIA Z Y, HE Q, ZHANG H J, et al. A News Event Detection and Tracking Algorithm Based on Dynamic Evolution Model. Journal of Computer Research and Development, 2004, 41(7): 1273-1280.) [14] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient Estimation of Word Representations in Vector Space[J/OL]. [2017-08-25]. https://arxiv.org/pdf/1301.3781.pdf. [15] SONG Y, HUANG J, ZHOU D, et al. IKNN: Informative K-Nearest Neighbor Pattern Classification // Proc of the European Confe-rence on Principles of Data Mining and Knowledge Discovery. Berlin, Germany: Springer, 2007: 248-264. [16] 冀俊忠,柴 鹰,贝 飞.基于时间片划分和多元数据融合的异质媒体网络社会事件发现.北京工业大学学报, 2015, 41(8): 1165-1171. (JI J Z, CHAI Y, BEI F. Time-Slice and Multi Metadata Fusion for Multimedia Social Event Detection. Journal of Beijing University of Technology, 2015, 41(8): 1165-1171.) [17] LI Z W, WANG B, LI M J, et al. A Probabilistic Model for Retrospective News Event Detection // Proc of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM, 2005: 106-113. [18] BLEI D M, NG A Y, JORDAN M I. Latent Dirichlet Allocation. Journal of Machine Learning Research, 2003, 3: 993-1022. [19] KEANE N, YEE C, ZHOU L. Using Topic Modeling and Similarity Thresholds to Detect Events // Proc of the 3rd Workshop on EVENTS at the NAACL-HLT. Stroudsburg, USA: ACL, 2015: 34-42. |
|
|
|