结合从句级远程监督与半监督集成学习的关系抽取方法<sup>*</sup>

doi:10.16451/j.cnki.issn1003-6059.201701006

摘要
图/表
参考文献
相关文章 (10)

全文: PDF (956 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要针对传统基于远程监督的关系抽取方法中存在噪声和负例数据利用不足的问题,提出结合从句级远程监督和半监督集成学习的关系抽取方法.首先通过远程监督构建关系实例集,使用基于从句识别的去噪算法去除关系实例集中的噪声.然后抽取关系实例的词法特征并转化为分布式表征向量,构建特征数据集.最后选择特征数据集中所有正例数据和部分负例数据组成标注数据集,其余的负例数据组成未标注数据集,通过改进的半监督集成学习算法训练关系分类器.实验表明,相比基线方法,文中方法可以获得更高的分类准确率和召回率.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	余小康
	陈岭
	郭敬
	蔡雅雅
	吴勇
	王敬昌

关键词 ：关系抽取, 远程监督, 从句识别, 去噪, 半监督集成学习

Abstract：Aiming at noisy data in training data and the insufficient use of negative instances in traditional distant supervision relation extraction methods, a relation extraction method combining clause level distant supervision and semi-supervised ensemble learning is proposed. Firstly, the relation instance set is generated by distant supervision. Secondly, based on clause identification, a denoising algorithm is used to reduce the wrongly labeled data in the relation instance set. Thirdly, the lexical features are extracted from relation instances and are transformed into distributed vectors to establish feature dataset. Finally, all positive data and part of negative data in feature dataset are chosen to form labeled dataset, and the other part of negative data are chosen to form unlabeled dataset. A relation classifier is trained through improved semi-supervised ensemble learning algorithm. Experiments show that compared with baseline methods the proposed method achieves higher accuracies and recall.

Key words： Relation Extraction Distant Supervision Clause Identification Noise Reduction Semi supervised Ensemble Learning

收稿日期: 2016-09-19

ZTFLH:

TP 311

基金资助:国家自然科学基金项目(No.61332017,60703040)、浙江省重大科技专项(No.2015C33002,2013C01046,2011C13042)、中国工程科技知识中心项目(No.CKCEST-2014-1-5)资助

作者简介: 余小康,男,1990年生,硕士研究生,主要研究方向为文本数据挖掘.E-mail:yuxiaokang202@163.com.陈岭(通讯作者),男,1977年生,博士,副教授,主要研究方向为普适计算、数据库.E-mail:lingchen@cs.zju.edu.cn.郭敬,男,1988年生,博士研究生,主要研究方向为文本数据挖掘.E-mail:guojing.zju@gmail.com.蔡雅雅,女,1990年生,硕士研究生,主要研究方向为普适计算.E-mail:cyybest@qq.com.吴勇,男,1977年生,工程师,主要研究方向为云计算.E-mail:wy@zjhcsoft.com.王敬昌,男,1977年生,硕士,高级工程师,主要研究方向为云计算.E-mail:wangjc@zjhcsoft.com.

引用本文:

余小康,陈岭,郭敬,蔡雅雅,吴勇,王敬昌. 结合从句级远程监督与半监督集成学习的关系抽取方法^*[J]. 模式识别与人工智能, 2017, 30(1): 54-63. YU Xiaokang, CHEN Ling, GUO Jing, CAI Yaya, WU Yong, WANG Jingchang. Relation Extraction Method Combining Clause Level Distant Supervision and Semi-supervised Ensemble Learning. , 2017, 30(1): 54-63.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/10.16451/j.cnki.issn1003-6059.201701006 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2017/V30/I1/54