基于词汇链的中文新闻网页关键词抽取方法

摘要
图/表
参考文献
相关文章 (7)

全文: PDF (515 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要词汇链是一种词语间语义关系引起的连贯性的外在表现，提供关于文本结构和主题的重要线索。文中在解决歧义消解问题的基础上提出利用词汇链，结合词频特征、位置特征和集聚特征抽取中文新闻网页关键词的方法。该方法根据词语在文档中语义联系将文档表示成词汇链形式，并在此基础上抽取关键词。对中文新闻网页和学术期刊文献两种语料进行实验，结果表明该方法可明显提高抽取的关键词质量。

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章

关键词 ：词汇链, 关键词抽取, 歧义消解, 语义相似度

Abstract：A lexical chain is an external performance consistency by semantically related words of a text, and it is the representation of the semantic content of a text. Based on the word ambiguity resolution, a method for keyword extraction from Chinese news web pages is proposed by using lexical chains combined with frequency features, location features and cohesion features. The document is represented as lexical chains by the relationship between phrases and the key phrases are extracted from the lexical chains. The proposed method is tested on the corpus of Chinese news web pages and journal articles. The experimental results show that the proposed method improves the quality of the keywords extraction.

Key words： Lexical Chain Keyword Extraction Ambiguity Resolution Semantic Similarity

收稿日期: 2008-06-06

ZTFLH:

TP181

基金资助:国家自然科学基金资助项目(No.60573174)

作者简介: 胡学钢，男，1961年生，教授，博士，主要研究方向为数据挖掘、机器学习、知识工程.E-mail:jsjxhuxg@hfut.edu.cn.李星华，男，1984年生，硕士，主要研究方向为数据挖掘.谢飞，男，1980年生，讲师，博士研究生，主要研究方向为文本挖掘.吴信东，男，1963年生，教授，博士生导师，主要研究方向为人工智能、数据挖掘研究.

引用本文:

胡学钢，李星华，谢飞，吴信东. 基于词汇链的中文新闻网页关键词抽取方法[J]. 模式识别与人工智能, 2010, 23(1): 45-51. HU Xue-Gang, LI Xing-Hua, XIE Fei, WU Xin-Dong. Keyword Extraction Based on Lexical Chains for Chinese News Web Pages. , 2010, 23(1): 45-51.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/ 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2010/V23/I1/45

[1] Luhn H P. A Statistical Approach to the Mechanized Encoding and Searching of Literary Information. IBM Journal of Research and Development, 1957, 1(4): 309-317
[2] Li Juanzi, Fan Qina, Zhang Kuo. Keyword Extraction Based on TF/IDF for Chinese News Document. Wuhan University Journal of Natural Sciences, 2007, 12(5): 917-921
[3] Ma Yinghua, Wang Yongcheng, Su Guiyang, et al. A Novel Chinese Text Subject Extraction Method Based on Character Co-Occurrence. Journal of Computer Research and Development, 2003, 40(6): 874-878 (in Chinese)
(马颖华,王永成,苏贵洋,等.一种基于字同现频率的汉语文本主题抽取方法.计算机研究与发展, 2003, 40(6): 874-878)
[4] Matsuo Y, Ishizuka M. Keyword Extraction from a Single Document Using Word Co-Occurrence Statistical Information. International Journal on Artificial Intelligence Tools, 2004, 13(1): 157-169
[5] Jiao Hui, Liu Qian, Jia Huibo. Chinese Keyword Extraction Based on N-Gram and Word Co-Occurrence // Proc of the International Conference on Computational Intelligence and Security Workshops. Harbin, China, 2007: 152-155
[6] Zhao Peng, Cai Qingsheng, Wang Qingyi, et al. An Automatic Keyword Extraction of Chinese Document Algorithm Based on Complex Network Features. Pattern Recognition and Artificial Intelligence, 2007, 20(6): 827-831 (in Chinese)
(赵鹏,蔡庆生,王清溢,等.一种基于复杂网络特征的中文文档关键词抽取算法.模式识别与人工智能, 2007, 20(6): 827-831)
[7] Turney P D. Learning Algorithms for Keyphrase Extraction. Information Retrieval, 2000, 2(4): 303-336
[8] Frank E, Paynter G W, Witten I H. Domain-Specific Keyphrase Extraction // Proc of the 16th International Joint Conference on Artificial Intelligence. Stockholm, Sweden, 1999: 668-673
[9] Witten I H, Paynter G W, Frank E, et al. KEA: Practical Automatic Keyphrase Extraction // Proc of the 4th ACM Conference on Digital Libraries. Berkeley, USA, 1999: 254-256
[10] Zhang Kuo, Xu Hui, Tang Jie, et al. Keyword Extraction Using Support Vector Machine // Proc of the 7th International Conference on Web-Age Information Management. Hongkong, China, 2006: 85-96
[11] Li Sujian, Wang Houfeng, Yu Shiwen, et al. Research on Maximum Entropy Model for Keyword Indexing. Chinese Journal of Computers, 2004, 27(9): 1192-1197 (in Chinese)
(李素建,王厚峰,俞士汶,等.关键词自动标引的最大熵模型应用研究.计算机学报, 2004, 27(9): 1192-1197)
[12] Su Jinshu, Zhang Bofeng, Xu Xin. Advances in Machine Learning Based Test Categorization. Journal of Software, 2006, 17(9): 1848-1859 (in Chinese)
(苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展.软件学报, 2006, 17(9): 1848-1859)
[13] Barzilay R, Elhadad M. Using Lexical Chains for Text Summarization // Mani I, Maybury M, eds. Advances in Automatic Text Summarization. Cambridge, USA: MIT Press, 1999: 111-122
[14] Ercan G, Cicekli I. Using Lexical Chains for Keyword Extraction. Information Processing and Management: An International Journal, 2007, 43(6): 1705-1714
[15] Suo Hongguang, Liu Yushu, Cao Shuying. A Keyword Selection Method Based on Lexical Chains. Journal of Chinese Information Processing, 2006, 20(6): 25-30 (in Chinese)
(索红光,刘玉树,曹淑英.一种基于词汇链的关键词抽取方法.中文信息学报, 2006, 20(6): 25-30)
[16] Chen Yanmin, Liu Bingquan, Wang Xiaolong. Automatic Text Summarization Based on Textual Cohesion. Journal of Electronics (China), 2007, 24 (3): 338-346
[17] Halliday M A K , Hasan R. Cohesion in English. London, UK: Longman, 1976
[18] Morris J, Hirst G. Lexical Cohesion Computed by Thesaural Relations as an Indicator of the Structure of the Text. Computational Linguistics, 1991, 17(1): 21-48
[19] Siber H G, McCoy K F. Efficient Text Summarization Using Lexical Chains // Proc of the 5th International Conference on Intelligent User Interfaces. New Orleans, USA, 2000: 252-255
[20] Galley M, McKeown K. Improving Word Sense Disambiguation in Lexical Chaining // Proc of the 18th International Joint Conference on Artificial Intelligence. Acapulco, Mexico, 2003: 1486-1488
[21] Liu Qun, Li Sujian. Word Similarity Computing Based on How-Net. Computational Linguistics and Chinese Language Processing, 2002, 7(2): 59-76 (in Chinese)
(刘群,李素建.基于《知网》的词汇语义相似度计算.计算语言学及中文信息处理, 2002, 7(2): 59-76)
[22] Dong Zhendong, Dong Qiang. HowNet [DB/OL]. [2008-02-04]. http://www.keenage.com