基于相关性及语义的<i>n</i>-grams特征加权算法<sup>*</sup>

doi:10.16451/j.cnki.issn1003-6059.201511005

摘要
图/表
参考文献
相关文章 (7)

全文: PDF (901 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要 n-grams作为文本分类特征时易造成分类准确率下降，并且在对n-grams加权时通常忽略单词间的冗余度和相关性.针对上述问题，文中提出基于相关性及语义的n-grams特征加权算法.在文本预处理时，对n-grams进行特征约简，降低内部冗余，再根据n-grams内单词与类别的相关性及n-grams与测试集的语义近似度加权.搜狗中文新闻语料库和网易文本分类语料库上的实验表明，文中算法能筛选高类别相关且低冗余的n-grams特征，在量化测试集时减少稀疏数据的产生.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	邱云飞
	刘世兴
	林明明
	邵良杉

关键词 ：最大相关度最小冗余度(mRMR), 语义相似度, n-grams, 特征加权

Abstract：When n-grams are considered as text classification features, the classification accuracy is decreased. The redundancy and relevance between words are ignored while n-grams are weighted. Thus, n-grams features weighting algorithm based on relevance and semantic is proposed. To decrease the internal redundancy, feature reduction is conducted to n-grams during text preprocessing. Then, n-grams are weighted according to the relevance of words and classes in n-grams and the semantic similarity of n-grams and testing dataset. The experimental results on Sougo Chinese news corpse and NetEase text corpse show that the proposed algorithm can select n-grams features of high relevance and low redundancy, and reduce the sparse data while quantifying the testing dataset.

Key words： Maximum Relevance Minimum Redundancy (mRMR) Semantic Similarity n-grams Feature Weighting

收稿日期: 2014-04-30

ZTFLH:

TP 391.1

基金资助:国家自然科学基金项目(No.70971059)、辽宁省创新团队项目(No.2009T045)、辽宁省高等学校杰出青年学者成长计划项目(No.LJQ2012027)资助

作者简介: 邱云飞，男，1976年生，博士，教授，主要研究方向为数据挖掘、情感分析.E-mail:qyf321@sohu.com.刘世兴(通讯作者)，男，1990年生，硕士研究生，主要研究方向为文本特征选择.E-mail:494784913@qq.com.林明明，女，1989年生，硕士研究生，主要研究方向为数据挖掘、情感分析.邵良杉，男，1961年生，博士，教授，主要研究方向为数据挖掘、情感分析.

引用本文:

邱云飞，刘世兴，林明明，邵良杉. 基于相关性及语义的n-grams特征加权算法^*[J]. 模式识别与人工智能, 2015, 28(11): 992-1001. QIU Yun-Fei, LIU Shi-Xing, LIN Ming-Ming, SHAO Liang-Shan. n-grams Features Weighting Algorithm Based on Relevance and Semantic. , 2015, 28(11): 992-1001.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/10.16451/j.cnki.issn1003-6059.201511005 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2015/V28/I11/992