Entity Relations Extraction in Chinese Domain Based on Distant Supervision with Multi-feature Fusion
WANG Bin1, GUO Jianyi1, 2, XIAN Yantuan1, 2, WANG Hongbin1, 2, YU Zhengtao1, 2
1.Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500; 2.Key Laboratory of Intelligent Information Processing, Kunming University of Science and Technology, Kunming 650500
Abstract:Aiming at the extraction of Chinese domain entity relationship from unlabeled text, a hybrid method of domain entity attribute extraction based on distant supervision is proposed. The structured relational three tuples in the knowledge base are applied to obtain the training corpus automatically from the natural language text. Due to the large amount of noise in the annotation data of distant supervision method, the latent Dirichlet allocation(LDA) topic model for topic keyword extraction is adopted, and then the similarity calculation with relationship type and keyword pattern matching for denoising are performed. Finally, the part-of-speech feature, the dependency feature and the phrase syntax tree feature are extracted, and the relationship extraction model is trained. Experiments show that the method fusing three features produces higher F value and better extraction performance.
[1] CRAVEN M, KUMLIEN J. Constructing Biological Knowledge Bases by Extracting Information from Text Sources[C/OL]. [2018-08-25]. http://www.aaai.org/Papers/ISMB/1999/ISMB99-010.pdf. [2] MINTZ M, BILLS S, SNOW R, et al. Distant Supervision for Relation Extraction without Labeled Data // Proc of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Stroudsburg, USA: ACL, 2009: 1003-1011. [3] 欧阳丹彤,瞿剑峰,叶育鑫.关系抽取中基于本体的远监督样本扩充.软件学报, 2014, 25(9): 2088-2101. (OUYANG D T, QU J F, YE Y X.Extending Training Set in Distant Supervision by Ontology for Relation Extraction. Journal of Software, 2014, 25(9): 2088-2101.) [4] 贾真,何大可,杨燕,等.基于弱监督学习的中文网络百科关系抽取.智能系统学报, 2015, 10(1): 113-119. (JIA Z, HE D K, YANG Y, et al. Relation Extraction from Chinese Online Encyclopedia Based on Weakly Supervised Learning. CAAI Transactions on Intelligent Systems,2015,10(1): 113-119.) [5] RIEDEL S, YAO L M, MCCALLUM A.Modeling Relations and Their Mentions without Labeled Text // Proc of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Berlin, Germany: Springer-Verlag, 2010: 148-163. [6] FAN M, ZAHO D L, ZHOU Q, ,et al. Errata: Distant Supervision for Relation Extraction with Matrix Completion[C/OL]. [2018-08-25]. https://arxiv.org/pdf/1411.. Errata: Distant Supervision for Relation Extraction with Matrix Completion[C/OL]. [2018-08-25]. https://arxiv.org/pdf/1411.4455.pdf. [7] TAKAMATSU S, SATO I, NAKAGAWA H.Reducing Wrong Labels in Distant Supervision for Relation Extraction // Proc of the 50th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, USA: ACL, 2012: 721-729. [8] QU J F, OUYANG D T, HUA W, et al. Distant Supervision for Neural Relation Extraction Integrated with Word Attention and Pro-perty Features. Neural Networks, 2018, 100: 59-69. [9] JI G L, LIU K, HE S Z, et al.Distant Supervision for Relation Extraction with Sentence-Level Attention and Entity Descriptions // Proc of the 31st AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2017: 3060-3066. [10] 刘剑,许洪波,唐慧丰,等.面向中文网络百科的语义知识库构建.系统仿真学报, 2016, 28(3): 542-548. (LIU J, XU H B, TANG H F, et al. Semantic Knowledge Base Constructed from Chinese Online Encyclopedia. Journal of System Simulation, 2016, 28(3): 542-548.) [11] XU B, XU Y, LIANG J Q, et al. CN-DBpedia: A Never-Ending Chinese Knowledge Extraction System // Proc of the International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems. Berlin, Germany: Springer, 2017: 428-438. [12] 张巧燕,林民,张树钧.基于维基百科的领域概念语义知识库的自动构建方法.计算机应用研究, 2018, 35(1): 130-134. (ZHANG Q Y, LIN M, ZHANG S J.Research on Automatic Construction of Domain Concepts on Wikipedia Semantic Knowledge Base. Application Research of Computers, 2018, 35(1): 130-134.) [13] 王磊,董玮,董少林,等.基于在线百科的知识库构建方法研究.信息系统工程, 2018(1): 110-111. (WANG L, DONG W, DONG S L, et al. Research on the Construction Method of Knowledge Base Based on Online Encyclopedia. Information Systems Engineering, 2018(1): 110-111.) [14] MIKOLOV T, CHEN K, CORRADO G, ,et al. Efficient Estimation of Word Representations in Vector Space[C/OL]. [2018-08-25]. https://arxiv.org/pdf/1301.. Efficient Estimation of Word Representations in Vector Space[C/OL]. [2018-08-25]. https://arxiv.org/pdf/1301.3781.pdf. [15] GOLDBERG Y, LEVY O. Word2vec Explained: Deriving Mikolovet al.'s Negative-Sampling Word-Embedding Method[C/OL]. [2018-08-25]. https://arxiv.org/pdf/1402.3722.pdf. [16] BLEI D M, NG A Y, JORDAN M I. Latent Dirichlet Allocation. Journal of Machine Learning Research Archive, 2003, 3: 993-1022. [17] CHEN W H, ZHANG X.Research on Text Categorization Model Based on LDA-KNN // Proc of the 2nd IEEE Advanced Information Technology, Electronic and Automation Control Conference. Washington, USA: IEEE, 2017: 2719-2726. [18] ZHU J R, WANG Q L, LIU Y, et al. A Method of Optimizing LDA Result Purity Based on Semantic Similarity // Proc of the 32nd Youth Academic Annual Conference of Chinese Association of Automation. Washington, USA: IEEE, 2017: 361-365. [19] KIM Y. Convolutional Neural Networks for Sentence Classification[C/OL]. [2018-08-25]. https://arxiv.org/pdf/1408.5882.pdf. [20] VAPNIK V N.The Nature of Statistical Learning Theory. New York, USA: Springer-Verlag, 1995. [21] HOCHREITER S, SCHMIDHUBER J.Long Short-Term Memory. Neural Computation, 1997, 9(8): 1735-1780. [22] LUONG M T, PHAM H, MANNING C D.Effective Approaches to Attention Based Neural Machine Translation[C/OL]. [2018-08-25].https://nlp.stanford.edu/pubs/emnlp15_attn.pdf.