Abstract:In graph theory, the tree edit distance (TED) between two directed labeled and rooted trees is a popular research issue. As a combination optimization problem, calculating TED is widely used in the detection of the structural similarity of semi-structural documents. In this paper, a concept named tree semantic edit distance (TSED) with the corresponding formula is proposed. Then a distance measure based on both TED and TSED is presented. The proposed distance is applied in clustering the document object model (DOM) trees of extensible markup language (XML) documents. Experimental results show the proposed measure is better than those used TED only in terms of clustering precision and recall. The time complexity of the proposed algorithm is the same as those of algorithms for TED based on dynamic programming.
康琪,马军. 有向标记根树之间的语义编辑距离[J]. 模式识别与人工智能, 2011, 24(6): 816-824.
KANG Qi, MA Jun. Semantic Edit Distance between Two Directed Labeled and Rooted Trees. , 2011, 24(6): 816-824.
[1] Fiesca S, Manco G, Masciari E, et al. Fast Detection of XML Structural Similarity. IEEE Trans on Knowledge and Data Engineering, 2005, 17(2): 160-175 [2] Ma Jun, Yi Yingnan, Tian Tian, et al. Retrieving Digital Artifacts from Digital Libraries Semantically // Proc of the International Conference on Intelligent Computing. Hefei, China, 2005: 340-349 [3] Ma Jun, Hemmje M. Knowledge Management Support for Cooperative Research // Proc of the 17th World Computer Congress. Montreal, Canada, 2002: 280-284 [4] Ma Jun, Shao Lu.An Optimal Algorithm for Fuzzy Classification Problem. Journal of Software, 2001,12 (4): 578-581 (in Chinese) (马 军,邵 陆.模糊聚类计算的最佳算法.软件学报, 2001, 12(4): 578-581) [5] Lei Jingsheng, Ma Jun, Jin Ting. A Fuzzy Clustering Technology Based on Hierarchical Neural Networks for Web Document. Journal of Computer Research and Development, 2006, 43(10): 1695-1699 (in Chinese) (雷景生,马 军,靳 婷.基于分级神经网络的Web文档模糊聚类技术.计算机研究与发展, 2006, 43(10):1696-1699) [6] Ma Jun, Chen Zhumin, Zhao Yan, et al. Computation of Document Structural Similarity Based on Part-Whole Matching. Pattern Recognition and Artificial Intelligence, 2007, 20(5): 630-635 (in Chinese) (马 军,陈竹敏,赵 嫣,等.基于部分-整体匹配的文档结构相似度计算.模式识别与人工智能, 2007, 20(5): 630-635) [7] Bertinoa E, Guerrinib E, Mesiti M. A Matching Algorithm for Measuring the Structural Similarity between an XML Document and a DOM and Its Applications. Information System, 2004, 29(1): 23-46 [8] Marian A. Detecting Changes in XML Documents // Proc of the 18th International Conference on Data Engineering. San Jose, USA, 2002: 137-146 [9] Buttler B. A Short Survey of Document Structure Similarity Algorithms // Proc of the International Conference on Internet Computing. Las Vegas, USA, 2004: 3-9 [10] Chen Weimin. New Algorithm for Ordered Tree-to-Tree Correction Problem. Journal of Algorithms, 2001, 40(2): 135-158 [11] Shasha D, Zhang Kaizhong. Fast Algorithms for the Unit Cost Editing Distance between Trees. Journal of Algorithms, 1990, 11(4): 135-145 [12] Zhang Kaizhong. Algorithms for the Constrained Editing Problem between Ordered Labeled Trees and Related Problems. Pattern Recognition, 1995, 28(3): 463-474 [13] Zhang Kaizhong, Shasha D. Simple Fast Algorithms for the Editing Distance between Trees and Related Problems. SIAM Journal of Computing, 1989, 18(6): 1245-1262 [14] Nieman A, Jagadish H V. Evaluating Structural Similarity in XML Documents // Proc of the 5th International Workshop on the Web and Databases. Madison, USA, 2002: 61-66 [15] Nayak R. Investigating Semantic Measures in XML Clustering // Proc of the IEEE/WIC/ACM International Conference on Web Intelligence. Hong Kong, China, 2006: 1042-1045 [16] Sager T, Bernstein A, Pinzger M, et al. Detecting Similar Java Classes Using Tree Algorithms // Proc of the International Workshop on Mining Software Repositories. Shanghai, China, 2006: 65-71 [17] Tai K. The Tree-to-Tree Correction Problem. Journal of ACM, 1979, 26(3): 422-433 [18] Klein P N. Computing the Edit-Distance between Unrooted Ordered Trees // Proc of the 6th Annual European Symposium on Algorithms. Venice, Italy, 1998: 91-102 [19] Wang Lian, Cheung D, Mamoulis N, et al. An Efficient and Scalable Algorithm for Clustering XML Documents by Structure. IEEE Trans on Knowledge and Data Engineering, 2004, 16(1): 82-96