一种非对称距离下的层次聚类算法<sup>*</sup>

摘要
图/表
参考文献
相关文章 (15)

全文: PDF (1260 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要层次聚类算法在数据挖掘领域有着广泛应用，现有的层次聚类算法都依赖于对称距离定义.针对聚类对象的非对称距离下的层次聚类展开研究，提出完整的非对称距离下的层次聚类算法，给出聚类对象选择因子，并定义相应的计算方法.文中提出不同簇之间的合并方法，形成非对称距离下的单连接、全连接等算法.采集社会化书签系统中的热点标签，基于共现次数定义非对称距离，对所提出的算法进行大量实验，实验结果表明聚类结果与实际结果具有较高的一致性.对算法进行量化指标分析的结果也表明非对称层次聚类算法具有良好性能.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	韩忠明
	陈妮
	张慧
	杨伟杰

关键词 ：非对称距离, 层次聚类, 数据挖掘

Abstract：Hierarchical clustering algorithm is applied in many research fields such as data mining and machine learning. Most existing hierarchical clustering algorithms are dependent on symmetrical distances definition. In this paper, a hierarchical clustering algorithm is proposed based on asymmetric distance. With respect to asymmetric distance characteristics, a selective factor and corresponding calculation formula are proposed. The single linkage, full linkage and average linkage algorithms for the asymmetric hierarchical clustering algorithm are implemented. The hot tags from main social bookmarking systems are extracted and an asymmetric distance is defined based on co-occurrence frequency of different tags. The experimental results show that the proposed algorithm outperforms the clustering algorithm based on symmetrical distance. The cophenetic coefficient is also used to evaluate effectiveness of the algorithm.

Key words： Asymmetric Distance Hierarchical Clustering Data Mining

收稿日期: 2013-04-12

ZTFLH:

TP 391

基金资助:国家自然科学基金项目(No.61170112)、北京市属高等学校科学技术与研究生教育创新工程建设项目(No.PXM2012_014213_000037)资助

作者简介: 韩忠明(通讯作者)，男，1972年生，博士，副教授，主要研究方向为互联网数据分析与挖掘、海量数据处理等.E-mail:hanzm@th.btbu.edu.cn.陈妮，女，1987年生，硕士研究生，主要研究方向为互联网数据分析与挖掘等.张慧，女，1989年生，硕士研究生，主要研究方向为互联网数据分析与挖掘等.杨伟杰，女，1980年生，博士，讲师，主要研究方向为互联网数据分析与挖掘等.

引用本文:

韩忠明，陈妮，张慧，杨伟杰. 一种非对称距离下的层次聚类算法^*[J]. 模式识别与人工智能, 2014, 27(5): 410-416. HAN Zhong-Ming,CHEN Ni,ZHANG Hui,YANG Wei-Jie. A Hierarchical Clustering Algorithm Based on Asymmetric Distance. , 2014, 27(5): 410-416.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/ 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2014/V27/I5/410

[1] Gower J C. The Analysis of Asymmetry and Orthogonality // Barra J R, Brodeau F, Romer G, et al ., eds. Recent Developments in Statistics. Amsterdam, Netherlands: North-Holland Pubhishing Co. 1977: 109-123
[2] Yadohisa H. Formulation of Asymmetric Agglomerative Clustering and Graphical Representation of its Results. Bulletin of the Computational Statistics of Japan, 2002, 15(2): 309-316
[3] Takeuchi A, Saito T, Yadohisa H. Asymmetric Agglomerative Hierarchical Clustering Algorithms and their Evaluations. Journal of Classification, 2007, 24(1): 123-143
[4] Zielman B, Heiser W J. Models for Asymmetric Proximities. British Journal of Mathematical and Statistical Psychology, 1996, 49(1): 127-146
[5]Saito T, Ydohisa H. Data Analysis of Asymmetric Structures: Advanced Approaches in Computational Statistics. New York, USA: Marcel Dekker, 2005
[6] Krishna K, Krishnapuram R. A Clustering Algorithm for Asymmetrically Related Data with Applications to Text Mining // Proc of the 10th International Conference on Information and Knowledge Management. Atlanta, USA, 2001: 571-573
[7] Szekely G J, Rizzo M L. Hierarchical Clustering via Joint between-within Distances: Extending Ward′s Minimum Variance Method. Journal of Classification, 2005, 22(2): 151-183
[8] Fernández A, Gómez S. Solving Non-uniqueness in Agglomerative Hierarchical Clustering Using Multidendrograms. Journal of Classification, 2008, 25(1): 43-65
[9] Halkidi M, Batistakis Y, Vazirgiannis M. On Clustering Validation Techniques. Journal of Intelligent Information Systems, 2001, 17(2/3): 107-145
[10] Takumi S, Miyamoto S. Agglomerative Clustering Using Asymmetric Similarities // Proc of the 8th International Conference on Modeling Decision for Artificial Intelligence. Changsha, China, 2011, 114-125
[11] Takumi S, Miyamoto S. A Family of Methods for Asymmetric Nearest Neighbor Clustering // Proc of the Analysis and Modeling of Complex Data in Behavioural and Social Sciences. Capri Island, Italy, 2012: 80-84
[12] Li Y Y, Shi H Z, Jiao L C, et al. Quantum-Inspired Evolutionary Clustering Algorithm Based on Manifold Distance. Acta Electronica Sinica, 2011, 39(10): 2343-2347 (in Chinese)
(李阳阳,石洪竺,焦李成,等.基于流形距离的量子进化聚类算法.电子学报, 2011, 39(10): 2343-2347)
[13] Wu Y H, Wang X L, Ding Y X, et al. Adaptive On-Line Web Topic Detection Method for Web News Recommendation System. Acta Eelectronica Sinica, 2010, 38(11): 2620-2624 (in Chinese)
(吴永辉,王晓龙,丁宇新,等.基于主题的自适应、在线网络热点发现方法及新闻推荐系统.电子学报, 2010, 38(11): 2620-2624)
[14] Lu R, Xiang L, Liu M R, et al. Discovering News Topics from Microblogs Based on Hidden Topics Analysis and Text Clustering. Pattern Recognition and Artificial Intelligence, 2012, 25(3): 382-387 (in Chinese)
(路荣,项亮,刘明荣,等.基于隐主题分析和文本聚类的微博客中新闻话题的发现.模式识别与人工智能, 2012, 25(3): 382-387)
[15] Sun G L, Wang X L, Liu B Q, et al. Statistical Chinese Chunking Model Based on Word Clustering Features. Acta Electronica Sinica, 2008, 36(12): 2450-2453 (in Chinese)
(孙广路,王晓龙,刘秉权,等.基于词聚类特征的统计中文组块分析模型.电子学报, 2008, 36(12): 2450-2453)
[16] Gan W Y, Li D Y, Wang J M. An Hierarchical Clustering Method Based on Data Fields. Acta Electronica Sinica, 2006, 34(2): 258-262 (in Chinese)
(淦文燕,李德毅,王建民.一种基于数据场的层次聚类方法.电子学报, 2006, 34(2): 258-262)
[17] Huang C H, Yin J, Hou F. A Text Similarity Measurement Combining Word Semantic Information with TF-IDF Method. Chinese Journal of Computers, 2011, 34(5): 856-864 (in Chinese)
(黄承慧,印鉴,侯昉.一种结合词项语义信息和 TF-IDF 方法的文本相似度量方法.计算机学报, 2011, 34(5): 856-864)
[18] Ceng Y L, Xu H B, Wu G W, et al. Clustering Algorithm Based on the Distributions of Intrinsic Clusters. Journal of Software, 2010, 21(11): 2802-2813 (in Chinese)
(曾依灵,许洪波,吴高巍,等.一种基于语料特性的聚类算法.软件学报, 2010, 21(11):2802-2813)
[19] Huang J B, Sun H L, Dustin B, et al. Mining Hierarchical Community Structure within Networks from Density-Connected Traveling Orders. Journal of Software, 2011, 22(5): 951-961 (in Chinese)
(黄健斌,孙鹤立,Dustin B,等.从链接密度遍历序列中挖掘网络社团的层次结构.软件学报, 2011, 22(5): 951-961)