模式识别与人工智能  2019, Vol. 32 Issue (2): 133-143    DOI: 10.16451/j.cnki.issn1003-6059.201902005
王斌1, 郭剑毅1, 2, 线岩团1, 2, 王红斌1, 2, 余正涛1, 2
1.昆明理工大学 信息工程与自动化学院 昆明 650500;
2.昆明理工大学 智能信息处理重点实验室 昆明 650500
Entity Relations Extraction in Chinese Domain Based on Distant Supervision with Multi-feature Fusion
WANG Bin1, GUO Jianyi1, 2, XIAN Yantuan1, 2, WANG Hongbin1, 2, YU Zhengtao1, 2
1.Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500;
2.Key Laboratory of Intelligent Information Processing, Kunming University of Science and Technology, Kunming 650500

摘要 针对从未标记的文本中抽取中文领域实体关系的问题,文中提出基于远程监督的领域实体属性关系抽取的混合方法,利用知识库中已有结构化的关系三元组,从自然语言文本中自动获取训练语料.针对远程监督方法标注数据存在大量噪声的问题,采用隐含狄利克雷分布主题模型抽取主题关键词,再与关系类型进行相似度计算和对关键词模式匹配进行去噪.最后提取词性特征、依存关系特征和短语句法树特征,并进行融合,训练关系抽取模型.实验表明,3种特征融合的F值较高,抽取性能较好.
关键词 远程监督实体关系抽取领域知识库特征融合隐含狄利克雷分布主题模型    
Abstract:Aiming at the extraction of Chinese domain entity relationship from unlabeled text, a hybrid method of domain entity attribute extraction based on distant supervision is proposed. The structured relational three tuples in the knowledge base are applied to obtain the training corpus automatically from the natural language text. Due to the large amount of noise in the annotation data of distant supervision method, the latent Dirichlet allocation(LDA) topic model for topic keyword extraction is adopted, and then the similarity calculation with relationship type and keyword pattern matching for denoising are performed. Finally, the part-of-speech feature, the dependency feature and the phrase syntax tree feature are extracted, and the relationship extraction model is trained. Experiments show that the method fusing three features produces higher F value and better extraction performance.
Key wordsDistant Supervision    Entity Relation Extraction    Domain Knowledge Base    Feature Fusion    Latent Dirichlet Allocation Topic Model   
收稿日期: 2018-10-15     
ZTFLH: TP 391.1  
作者简介: 王 斌,硕士研究生,主要研究方向为自然语言处理.E-mail:1105193825@qq.com. 郭剑毅(通讯作者),硕士,教授,主要研究方向为模式识别、自然语言处理、信息抽取、知识获取.E-mail:giade86@hotmail.com. 线岩团,博士研究生,讲师,主要研究方向为机器翻译、信息检索、信息抽取.E-mail:yantuan.xian@gmail.com. 王红斌,博士研究生,主要研究方向为智能信息系统、自然语言处理、信息检索.E-mail:whbin2007@126.com. 余正涛,博士,教授,主要研究方向为机器翻译、自然语言处理、信息检索.E-mail:ztyu@hotmail.com.
