State Key Laboratory of Intelligent Technology and Systems, Beijing 100084 Tsinghua National Laboratory for Information Science and Technology, Beijing 100084 Department of Computer Science and Technology, Tsinghua University, Beijing 100084
Abstract:New words discovery is of great significance in the field of natural language processing. It is more difficult to find new words in microblog than in other corpus. In this paper, an algorithm based on context entropy is proposed, and the new word candidates are filtered based on the context. To improve the precision, lexical features are introduced and an algorithm combining them with term frequency is put forward. Thus, the precision rate and the recall rate are greatly improved, and the F-measure value is up to 89.6%.
霍帅,张敏,刘奕群,马少平. 基于微博内容的新词发现方法*[J]. 模式识别与人工智能, 2014, 27(2): 141-145.
HUO Shuai, ZHANG Min, LIU Yi-Qun, MA Shao-Ping. New Words Discovery in Microblog Content. , 2014, 27(2): 141-145.
[1] Sproat R, Emerson T. The First International Chinese Word Segmentation Bakeoff [EB/OL]. [2013-03-10]. http://acl.ldc.upenn.edu/W/W03/W03-1719.pdf [2] Li H Q, Huang C N, Gao J F, et al. The Use of SVM for Chinese New Word Identification // Proc of the 1st International Joint Conference on Natural Language Processing. Sanya, China, 2004: 723-732 [3] Chen K J, Ma W Y. Unknown Word Extraction for Chinese Documents [EB/OL]. [2013-03-10]. http://acl.ldc.upenn.edu/coling2002/proceedings/data/area-09/co-128.pdf [4] Zou G, Liu Y, Liu Q, et al. Internet-Oriented Chinese New Words Detection. Journal of Chinese Information Processing, 2004, 18(6): 1-9 (in Chinese) (邹 纲,刘 洋,刘 群,等.面向 Internet 的中文新词语检测.中文信息学报, 2004, 18(6): 1-9) [5] Yang X M, Yang W Q. An Analysis on the Modern Chinese Neologisms. Chinese Language Learning, 2009, (1) : 97-104 (in Chinese) (杨绪明,杨文全.当代汉语新词新语探析.汉语学习, 2009, (1): 97-104) [6] Isozaki H. Japanese Named Entity Recognition Based on a Simple Rule Generator and Decision Tree Learning // Proc of the 39th Annual Meeting on Association for Computational Linguistics. Toulouse, France, 2001: 306-313 [7] Sui Z F, Chen Y R, Wu Y F, et al. The Research on the Automatic Term Extraction in the Domain of Information Science and Technology [EB/OL].[2013-03-10]. http://icl.pku.edu.cn/icl_tr/papers_2000-2003/2002/E026-szf-信息科学与技术领域术语自动提取研究.pdf [8] Wang M C, Huang J R, Chen K J. The Identification and Classification of Unknown Words in Chinese: An N-Grams-Based Approach // Proc of the Kyoto Conference: A Festschrift for Professor Akira Ikeya. Tokyo, Japan, 1995: 113-123 [9] Sornlertlamvanich V, Potipiti T, Charoenporn T. Automatic Corpus-Based Thai Word Extraction with the C4.5 Learning Algorithm // Proc of the 18th International Conference on Computational Linguistics. Saarbrücken, Germany, 2000, II: 802-807 [10] Liu H. A New Approach for Domain New Words Detection. Journal of Chinese Information Processing, 2006, 20(5): 17-23 (in Chinese) (刘 华.一种快速获取领域新词语的新方法.中文信息学报, 2006, 20(5): 17-23) [11] Luo Z Y, Rou S. An Integrated Method for Chinese Unknown Word Extraction // Proc of the 3rd SIGHAN Workshop on Chinese Language Learning. Barcelona, Spain, 2004: 148-154 [12] Huang X, Li R F. Discovery Method of New Words in Blog Contents. Modern Electronics Technique, 2013, 36(2): 144-146 (in Chinese)