Abstract:Hierarchical clustering algorithm is applied in many research fields such as data mining and machine learning. Most existing hierarchical clustering algorithms are dependent on symmetrical distances definition. In this paper, a hierarchical clustering algorithm is proposed based on asymmetric distance. With respect to asymmetric distance characteristics, a selective factor and corresponding calculation formula are proposed. The single linkage, full linkage and average linkage algorithms for the asymmetric hierarchical clustering algorithm are implemented. The hot tags from main social bookmarking systems are extracted and an asymmetric distance is defined based on co-occurrence frequency of different tags. The experimental results show that the proposed algorithm outperforms the clustering algorithm based on symmetrical distance. The cophenetic coefficient is also used to evaluate effectiveness of the algorithm.
[1] Gower J C. The Analysis of Asymmetry and Orthogonality // Barra J R, Brodeau F, Romer G, et al ., eds. Recent Developments in Statistics. Amsterdam, Netherlands: North-Holland Pubhishing Co. 1977: 109-123 [2] Yadohisa H. Formulation of Asymmetric Agglomerative Clustering and Graphical Representation of its Results. Bulletin of the Computational Statistics of Japan, 2002, 15(2): 309-316 [3] Takeuchi A, Saito T, Yadohisa H. Asymmetric Agglomerative Hierarchical Clustering Algorithms and their Evaluations. Journal of Classification, 2007, 24(1): 123-143 [4] Zielman B, Heiser W J. Models for Asymmetric Proximities. British Journal of Mathematical and Statistical Psychology, 1996, 49(1): 127-146 [5]Saito T, Ydohisa H. Data Analysis of Asymmetric Structures: Advanced Approaches in Computational Statistics. New York, USA: Marcel Dekker, 2005 [6] Krishna K, Krishnapuram R. A Clustering Algorithm for Asymmetrically Related Data with Applications to Text Mining // Proc of the 10th International Conference on Information and Knowledge Management. Atlanta, USA, 2001: 571-573 [7] Szekely G J, Rizzo M L. Hierarchical Clustering via Joint between-within Distances: Extending Ward′s Minimum Variance Method. Journal of Classification, 2005, 22(2): 151-183 [8] Fernández A, Gómez S. Solving Non-uniqueness in Agglomerative Hierarchical Clustering Using Multidendrograms. Journal of Classification, 2008, 25(1): 43-65 [9] Halkidi M, Batistakis Y, Vazirgiannis M. On Clustering Validation Techniques. Journal of Intelligent Information Systems, 2001, 17(2/3): 107-145 [10] Takumi S, Miyamoto S. Agglomerative Clustering Using Asymmetric Similarities // Proc of the 8th International Conference on Modeling Decision for Artificial Intelligence. Changsha, China, 2011, 114-125 [11] Takumi S, Miyamoto S. A Family of Methods for Asymmetric Nearest Neighbor Clustering // Proc of the Analysis and Modeling of Complex Data in Behavioural and Social Sciences. Capri Island, Italy, 2012: 80-84 [12] Li Y Y, Shi H Z, Jiao L C, et al. Quantum-Inspired Evolutionary Clustering Algorithm Based on Manifold Distance. Acta Electronica Sinica, 2011, 39(10): 2343-2347 (in Chinese) (李阳阳,石洪竺,焦李成,等.基于流形距离的量子进化聚类算法.电子学报, 2011, 39(10): 2343-2347) [13] Wu Y H, Wang X L, Ding Y X, et al. Adaptive On-Line Web Topic Detection Method for Web News Recommendation System. Acta Eelectronica Sinica, 2010, 38(11): 2620-2624 (in Chinese) (吴永辉,王晓龙,丁宇新,等.基于主题的自适应、在线网络热点发现方法及新闻推荐系统.电子学报, 2010, 38(11): 2620-2624) [14] Lu R, Xiang L, Liu M R, et al. Discovering News Topics from Microblogs Based on Hidden Topics Analysis and Text Clustering. Pattern Recognition and Artificial Intelligence, 2012, 25(3): 382-387 (in Chinese) (路 荣,项 亮,刘明荣,等.基于隐主题分析和文本聚类的微博客中新闻话题的发现.模式识别与人工智能, 2012, 25(3): 382-387) [15] Sun G L, Wang X L, Liu B Q, et al. Statistical Chinese Chunking Model Based on Word Clustering Features. Acta Electronica Sinica, 2008, 36(12): 2450-2453 (in Chinese) (孙广路,王晓龙,刘秉权,等.基于词聚类特征的统计中文组块分析模型.电子学报, 2008, 36(12): 2450-2453) [16] Gan W Y, Li D Y, Wang J M. An Hierarchical Clustering Method Based on Data Fields. Acta Electronica Sinica, 2006, 34(2): 258-262 (in Chinese) (淦文燕,李德毅,王建民.一种基于数据场的层次聚类方法.电子学报, 2006, 34(2): 258-262) [17] Huang C H, Yin J, Hou F. A Text Similarity Measurement Combining Word Semantic Information with TF-IDF Method. Chinese Journal of Computers, 2011, 34(5): 856-864 (in Chinese) (黄承慧,印 鉴,侯 昉.一种结合词项语义信息和 TF-IDF 方法的文本相似度量方法.计算机学报, 2011, 34(5): 856-864) [18] Ceng Y L, Xu H B, Wu G W, et al. Clustering Algorithm Based on the Distributions of Intrinsic Clusters. Journal of Software, 2010, 21(11): 2802-2813 (in Chinese) (曾依灵,许洪波,吴高巍,等.一种基于语料特性的聚类算法.软件学报, 2010, 21(11):2802-2813) [19] Huang J B, Sun H L, Dustin B, et al. Mining Hierarchical Community Structure within Networks from Density-Connected Traveling Orders. Journal of Software, 2011, 22(5): 951-961 (in Chinese) (黄健斌,孙鹤立,Dustin B,等.从链接密度遍历序列中挖掘网络社团的层次结构.软件学报, 2011, 22(5): 951-961)