Computation of Document Structural Similarity Based on PartWhole Matching
MA Jun1, CHEN ZhuMin1, ZHAO Yan1, LEI JingSheng1,2
1.Department of Computer Science and Technology, Shandong University, Jinan 250061 2.College of Information Science and Technology, Hainan University, Haikou 570228
Abstract:Traditional algorithms for document structural similarity (DSS) are based on either tree edit distances or Fourier transformation. New ways for the computation of DSS are presented based on partwhole matching between the tree Q corresponding to the structural question description and the tree T corresponding to structural description of a document. A way is provided to label above trees by strings then the DSS between two trees is calculated based on string matching operations. Experimental results show the proposed algorithms are better than those based on tree edit distance in terms of recall and precision.
马军,陈竹敏,赵嫣,雷景生. 基于部分整体匹配的文档结构相似度计算*[J]. 模式识别与人工智能, 2007, 20(5): 630-635.
MA Jun , CHEN ZhuMin , ZHAO Yan , LEI JingSheng. Computation of Document Structural Similarity Based on PartWhole Matching. , 2007, 20(5): 630-635.
[1] BaezaYates R, RibeiroNeto B. Modern Information Retrieval. Milan, Italy: Addison Wesley, 1999 [2] Shasha D, Zhang Kaizhong. Fast Algorithms for the Unit Cost Editing Distance between Trees. Journal of Algorithms, 1990, 11(4): 581621 [3] Nieman A, Jagadish H V. Evaluating Structural Similarity in XML Documents // Proc of the 5th International Workshop on the Web and Database. Madison, USA, 2002: 6166 [4] Fiesca S, Manco G, Masciari E, et al. Fast Detection of XML Structural Similarity. IEEE Trans on Knowledge and Data Engineering, 2005, 17(2): 160175 [5] Bullter D. A Short Survey of Document Structure Similarity Algorithms // Proc of the International Conference on Internet Computing. Las Vegas, USA, 2004: 39 [6] Lu S Y. A TreetoTree Distance and Its Application to Cluster Analysis. IEEE Trans on Pattern Analysis and Machine Intelligence, 1979, 1(2): 219224 [7] Wang Xiaoling, Wen Jirong, Luan Jinfeng, et al. A Method to Query Document Database by Content and Structure. Journal of Software, 2003, 14(5): 976983 (in Chinese) (王晓玲,文继荣,栾金锋,等.一种通过内容和结构查询文档数据库的方法. 软件学报, 2003, 14(5): 976983) [8] Chen Jianjun, de Witt D J, Tian Feng, et al. NiagaraCQ: A Scalable Continuous Query System for Internet Data Base // Proc of the ACM SIGMOD International Conference on Management of Data. Dallas, USA, 2000: 379390 [9] Wang Yuan, de Witt D J, Cai Jinyi. XDiff: An Effective Change Detection Algorithm for XML Documents // Proc of the 19th International Conference on Data Engineering. Bangalore, India, 2003: 519530 [10] Cobena G, Abiteboul S, Marian A. Detecting Changes in XML Documents // Proc of the International Conference on Data Engineering. San Jose, USA, 2002: 4152 [11] Ma Jun, Hemmje M. Knowledge Management Support for Cooperative Research // Proc of the 17th World Computer Congress on Intelligent Information Processing. Montreal, Canada, 2002: 280284 [12] Ma Jun, Yi Yingnan, Tian Tian, et al. Retrieving Digital Artifacts from Digital Libraries Semantically // Proc of the International Conference on Intelligent Computing. Hefei, China, 2005: 340349 [13] Lian Li, Ma Jun, Lei Jingsheng, et al. Subdivision and Nature Analysis of PartWhole Relation. Computer Engineering. 2006, 32(17): 8385 (in Chinese) (连 莉,马 军,雷景生,等.PARTWHOLE关系的细分及性质分析. 计算机工程. 2006, 32(17): 8385) [14] Li Sen, Ma Jun, Zhao Yan, et al. The Study on Automatic Classification of Digital Documents of Scientific Papers. Journal of Shandong University: Natural Science, 2006, 41(3): 8184 (in Chinese) (李 森,马 军,赵 嫣,等.对数字化科技论文的自动分类研究. 山东大学学报: 理学版, 2006, 41(3): 8184) [15] Lei Jingsheng, Ma Jun, Jin Ting. A Fuzzy Clustering Technology Based on Hierarchical Neural Networks for Web Document. Journal of Computer Research and Development, 2006, 43(10): 16951699 (in Chinese) (雷景生,马 军,靳 婷.基于分级神经网络的Web文档模糊聚类技术.计算机研究与发展, 2006, 43(10): 16951699) [16] Song Ling, Ma Jun, Lian Li, et al. The Study on the Comprehensive Computation of the Documents Similarity. Computer Engineering and Applications, 2006, 42(30): 160164 (in Chinese) (宋 玲,马 军,连 莉,等.文档相似度综合计算研究.计算机工程与应用, 2006, 42(30): 160164) [17] Song Ling, Ma Jun, Lian Li, et al. Fuzzy Similarity from Conceptual Relations // Proc of the IEEE AsiaPacific Conference on Services Computing. Guangzhou, China, 2006: 310 [18] Knuth D E, Morries J H, Pratt V R. Fast Pattern Matching in Strings. SIAM Journal on Computing, 1977, 6(1): 323350 [19] Hall P A V, Dowling G R. Approximate String Matching. ACM Computing Surveys, 1980, 12(4): 381402 [20] Landou G M, Vishkin U. Introducing Efficient Parallism into Approximate String Matching and a New Serial Algorithm // Proc of the 18th Annual ACM Symposium on Theory of Computing. San Jose, USA, 1986: 220230 [21] ACM SIGMOD Record [OB/OL]. [2006-09-16]. http://www.acm.org/sigmod/record/xml [22] W3SCHOOLS XML Record [OB/OL]. [2006-09-16]. http://www.w3schools.com/xml