基于子树匹配的文本相似度算法

摘要
图/表
参考文献
相关文章 (15)

全文: PDF (0 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要为降低文本向量维度,提高文本间语义相似度度量性能,综合利用统计方法与语义词典的优势,提出一种文本相似度算法.基于文本生成元数据特征向量,减少向量空间维度,设计基于子树匹配的文本相似度算法,借助子树加速对文本相似度的计算,并通过将文本元数据特征向量与子树进行相似度语义匹配,提高文本相似度计算时语义相似度度量的准确性.该算法考虑到对元数据中同义词的语义理解,加强文本之间相似度度量时的语义覆盖能力.实验结果表明文中所提出的方法是可行和有效的.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	张佩云
	陈传明
	黄波

关键词 ：元数据, 子树匹配, 语义, 文本相似度

Abstract：To reduce the dimensionality of text vectors and improve the performance of semantic similarity measurement,an algorithm for texts similarity computation is proposed,which combines the advantages of the statistical methods and semantic dictionary. The texts are utilized to generate metadata feature vectors,so that it reduces the dimensionality of text vectors space. The algorithm for computing texts similarity is designed based on subtrees matching and the speed of computing texts similarity is improved. The accuracy of texts semantic similarity measurement is improved by utilizing the semantic matching of metadata feature vectors and subtrees. The synonyms widely existing in metadata are processed by the proposed method,and the semantic coverage ability for similarity computation of texts is also enhanced. The experimental results show that the proposed method is feasible and effective.

收稿日期: 2013-05-06

ZTFLH:

TP 311

基金资助:国家自然科学基金项目(No.61201252,61203173)、中国博士后科学基金项目(No.2013M531528)、安徽省自然科学基金项目(No.1308085MF100)、安徽省高校省级自然科学研究重点项目(No.KJ2011A128)、安徽省科技厅软科学研究计划项目(No.11020503009)资助。

作者简介: 张佩云(通讯作者)，女，1974年生，博士，副教授，主要研究方向为智能信息处理、服务计算、语义网等.E-mail:zpyustc@ustc.edu.cn.陈传明，男，1981年生，讲师，博士研究生，主要研究方向为数据挖掘.黄波，男，1980年生，博士，副教授，主要研究方向为计算机网络技术、智能信息处理等.

引用本文:

张佩云，陈传明，黄波. 基于子树匹配的文本相似度算法[J]. 模式识别与人工智能, 2014, 27(3): 226-234. ZHANG Pei-Yun,CHEN Chuan-Ming,HUANG Bo. Texts Similarity Algorithm Based on Subtrees Matching. , 2014, 27(3): 226-234.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/ 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2014/V27/I3/226

[1] Han J W,Kamber M,Pei J. Data Mining: Concept and Techniques. 2nd Edition. Amsterdam,Holland: Elsevier,2006
[2] Shi K S,Liu H T,Song W T. A Text Clustering Method Based on Speech to Text and Improved Center Selection. Pattern Recognition and Artificial Intelligence,2012,25(6): 996-1001 (in Chinese)
(施侃晟,刘海涛,宋文涛.基于词性和中心点改进的文本聚类方法.模式识别与人工智能,2012,25(6): 996-1001)
[3] Xu G,Wang H F. The Development of Topic Models in Natural Language Processing. Chinese Journal of Computers,2011,34(8): 1423-1436 (in Chinese)
(徐戈,王厚峰.自然语言处理中主题模型的发展.计算机学报,2011,34(8): 1423-1436)
[4] Sánchez J A,Medina M A,Starostenko O,et al. Organizing Open Archives via Lightweight Ontolog to Facilitate the Use of Heterogeneous Collections. Aslib Proceedings,2012,64(1): 46-66
[5] Vicient C,Sánchez D,Moreno A. An Automatic Approach for Ontology-Based Feature Extraction from Heterogeneous Documental Resources. Engineering Applications of Artificial Intelligence,2013,26: 1092-1106
[6] Liu Q,Li S J. Semantic Similarity Calculation Based on HowNet // Proc of the 3rd Chinese Lexical Semantics Workshop. Taipei,China,2002: 59-76 (in Chinese)
(刘群,李素建.基于知网的词汇语义相似度计算//第3届汉语词汇语义学研讨会.台北,中国,2002: 59-76)
[7] Peng J,Yang D Q,Tang S W,et al. Text Similarity Computing Based on Concept Similarity. Science in China Series F:Information Science,2009,39(5): 534-544 (in Chinese)
(彭京,杨冬青,唐世渭,等.基于概念相似度的文本相似度计算.中国科学F辑:信息科学,2009,39(5): 534-544)
[8] Jin B,Shi Y J,Teng H F. Similarity Algorithm of Text Based on Semantic Understanding. Journal of Dalian University of Technology,2005,45(2): 291-297 (in Chinese)
(金博,史彦军,滕弘飞.基于语义理解的文本相似度算法.大连理工大学学报,2005,45(2): 291-297)
[9] Abdalgader K,Skabar A. Unsupervised Similarity-Based Word Sense Disambiguation Using Context Vectors and Sentential Word Importance. ACM Trans on Speech and Language Processing,2012. DOI: 10.1145/2168748.2168750
[10] Capelle M,Hogenboom F,Hogenboom A,et al. Semantic News Recommendation Using WordNet and Bing Similarities // Proc of the 28th Annual ACM Symposium on Applied Computing. Coimbra,Portugal,2013: 296-302
[11] Buscaldi D,Tournier R,Aussenac-Gilles N,et al. IRIT: Textual Similarity Combining Conceptual Similarity with an N-Gram Comparison Method // Proc of the 1st Joint Conference on Lexical and Computational Semantics. Montreal,Canada,2012: 552-556
[12] Ramage D,Rafferty A N,Manning C D. Random Walks for Text Semantic Similarity // Proc of the Workshop on Graph-Based Methods for Natural Language Processing. Singapore,Singapore,2009: 23-31
[13] Budanitsky A,Hirst G. Evaluating WordNet-Based Measures of Lexical Semantic Relatedness.Computational Linguistics,2006,32(1): 13-47
[14] Avancini H,Lavelli A,Sebastiani F,et al. Automatic Expansion of Domain-Specific Lexicons by Term Categorization. ACM Trans on Speech and Language Processing,2006,3(1): 1-30
[15] Bhagwani S,Satapathy S,Karnick H. Sranjans: Semantic Textual Similarity Using Maximal Weighted Bipartite Graph Matching // Proc of the 1st Joint Conference on Lexical and Computational Semantics. Montreal,Canada,2012: 579-585
[16] Wang J Z,Taylor W. Concept Forest: A New Ontology-Assisted Text Document Similarity Measurement Method // Proc of the IEEE/WIC/ACM International Conference on Web Intelligence. Fremont,USA,2007: 395-401
[17] Tsatsaronis G,Varlamis I,N rv g K. SemaFor: Semantic Document Indexing Using Semantic Forests // Proc of the 21st ACM International Conference on Information and Knowledge Management. Maui,USA,2012: 1692-1696
[18] Huang C H,Yin J,Hou F. A Text Similarity Measurement Combining Word Semantic Information with TF-IDF Method. Chinese Journal of Computers,2011,34(5): 856-864 (in Chinese)
(黄承慧,印鉴,侯昉.一种结合词项语义信息和TF-IDF方法的文本相似度量方法.计算机学报,2011,34(5): 856-864)
[19] Pincombe B. Comparison of Human and Latent Semantic Analysis (LSA) Judgments of Pairwise Document Similarities for a News Corpus. Technical Report,DSTO-RR-0278. State College,USA: The Pennsylvania State University,2004
[20] Lee M D,Pincombe B,Welsh M. An Empirical Evaluation of Models of Text Document Similarity // Proc of the 27th Annual Conference of the Cognitive Science Society. Stresa,Italy,2005: 1254-1259