有向标记根树之间的语义编辑距离

摘要
图/表
参考文献
相关文章 (4)

全文: PDF (507 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要有向标记根树之间的编辑距离(TED)被广泛应用在文档的结构化相似度计算上。文中提出有向标记根树之间的语义编辑距离(TSED)的概念，并给出计算公式。组合TED和TSED形成距离测度，并应用在XML文档的结构聚类上。实验表明该距离模型在结构化聚类的准确率和召回率上明显优于单纯利用TED算法的聚类结果。该算法在时间复杂性上也等同于利用动态规划计算TED的最好算法。

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	康琪
	马军

关键词 ：树编辑距离, 文档聚类, 结构相似度, 语义相似性

Abstract：In graph theory, the tree edit distance (TED) between two directed labeled and rooted trees is a popular research issue. As a combination optimization problem, calculating TED is widely used in the detection of the structural similarity of semi-structural documents. In this paper, a concept named tree semantic edit distance (TSED) with the corresponding formula is proposed. Then a distance measure based on both TED and TSED is presented. The proposed distance is applied in clustering the document object model (DOM) trees of extensible markup language (XML) documents. Experimental results show the proposed measure is better than those used TED only in terms of clustering precision and recall. The time complexity of the proposed algorithm is the same as those of algorithms for TED based on dynamic programming.

Key words： Tree Edit Distance Document Clustering Structural Similarity Semantic Similarity

收稿日期: 2010-06-17

ZTFLH:

TP391.4

基金资助:国家自然科学基金项目(No.60970047)、中国博士后科学基金项目(No.20100471503)、山东省自然科学基金项目(No.Y2008G19)和山东省科技攻关项目(No.2007GG10001002,2008GG10001026)资助

作者简介: 康琪，男，1986年生，硕士，主要研究方向为结构化信息检索.E-mail:kangqi_sdu@hotmail.com.马军，男，1956年生，教授，博士生导师，主要研究方向为信息检索和并行计算.E-mail:majun@sdu.edu.cn.

引用本文:

康琪，马军. 有向标记根树之间的语义编辑距离[J]. 模式识别与人工智能, 2011, 24(6): 816-824. KANG Qi, MA Jun. Semantic Edit Distance between Two Directed Labeled and Rooted Trees. , 2011, 24(6): 816-824.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/ 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2011/V24/I6/816

[1] Fiesca S, Manco G, Masciari E, et al. Fast Detection of XML Structural Similarity. IEEE Trans on Knowledge and Data Engineering, 2005, 17(2): 160-175
[2] Ma Jun, Yi Yingnan, Tian Tian, et al. Retrieving Digital Artifacts from Digital Libraries Semantically // Proc of the International Conference on Intelligent Computing. Hefei, China, 2005: 340-349
[3] Ma Jun, Hemmje M. Knowledge Management Support for Cooperative Research // Proc of the 17th World Computer Congress. Montreal, Canada, 2002: 280-284
[4] Ma Jun, Shao Lu.An Optimal Algorithm for Fuzzy Classification Problem. Journal of Software, 2001,12 (4): 578-581 (in Chinese)
(马军,邵陆.模糊聚类计算的最佳算法.软件学报, 2001, 12(4): 578-581)
[5] Lei Jingsheng, Ma Jun, Jin Ting. A Fuzzy Clustering Technology Based on Hierarchical Neural Networks for Web Document. Journal of Computer Research and Development, 2006, 43(10): 1695-1699 (in Chinese)
(雷景生,马军,靳婷.基于分级神经网络的Web文档模糊聚类技术.计算机研究与发展, 2006, 43(10):1696-1699)
[6] Ma Jun, Chen Zhumin, Zhao Yan, et al. Computation of Document Structural Similarity Based on Part-Whole Matching. Pattern Recognition and Artificial Intelligence, 2007, 20(5): 630-635 (in Chinese)
(马军,陈竹敏,赵嫣,等.基于部分-整体匹配的文档结构相似度计算.模式识别与人工智能, 2007, 20(5): 630-635)
[7] Bertinoa E, Guerrinib E, Mesiti M. A Matching Algorithm for Measuring the Structural Similarity between an XML Document and a DOM and Its Applications. Information System, 2004, 29(1): 23-46
[8] Marian A. Detecting Changes in XML Documents // Proc of the 18th International Conference on Data Engineering. San Jose, USA, 2002: 137-146
[9] Buttler B. A Short Survey of Document Structure Similarity Algorithms // Proc of the International Conference on Internet Computing. Las Vegas, USA, 2004: 3-9
[10] Chen Weimin. New Algorithm for Ordered Tree-to-Tree Correction Problem. Journal of Algorithms, 2001, 40(2): 135-158
[11] Shasha D, Zhang Kaizhong. Fast Algorithms for the Unit Cost Editing Distance between Trees. Journal of Algorithms, 1990, 11(4): 135-145
[12] Zhang Kaizhong. Algorithms for the Constrained Editing Problem between Ordered Labeled Trees and Related Problems. Pattern Recognition, 1995, 28(3): 463-474
[13] Zhang Kaizhong, Shasha D. Simple Fast Algorithms for the Editing Distance between Trees and Related Problems. SIAM Journal of Computing, 1989, 18(6): 1245-1262
[14] Nieman A, Jagadish H V. Evaluating Structural Similarity in XML Documents // Proc of the 5th International Workshop on the Web and Databases. Madison, USA, 2002: 61-66
[15] Nayak R. Investigating Semantic Measures in XML Clustering // Proc of the IEEE/WIC/ACM International Conference on Web Intelligence. Hong Kong, China, 2006: 1042-1045
[16] Sager T, Bernstein A, Pinzger M, et al. Detecting Similar Java Classes Using Tree Algorithms // Proc of the International Workshop on Mining Software Repositories. Shanghai, China, 2006: 65-71
[17] Tai K. The Tree-to-Tree Correction Problem. Journal of ACM, 1979, 26(3): 422-433
[18] Klein P N. Computing the Edit-Distance between Unrooted Ordered Trees // Proc of the 6th Annual European Symposium on Algorithms. Venice, Italy, 1998: 91-102
[19] Wang Lian, Cheung D, Mamoulis N, et al. An Efficient and Scalable Algorithm for Clustering XML Documents by Structure. IEEE Trans on Knowledge and Data Engineering, 2004, 16(1): 82-96