|
|
Clustering Algorithm for Mixed Data Based on Dimensional Frequency Dissimilarity and Strongly Connected Fusion |
QIAN Chaokai, HUANG Decai |
College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou 310014 |
|
|
Abstract The clustering result of k-Prototypes algorithm is unpredictable due to the sensitivity of the initial prototypes selection. Moreover, the whole diversity between data points and clusters is ignored. Therefore, a clustering algorithm based on dimensional frequency dissimilarity and strongly connected fusion is proposed. Plenty of sub-clusters are produced by multiple pre-clustering. According to the connectivity of those sub-clusters, strongly connected fusion is used to generate the final clusters. The proposed clustering algorithm is validated on three different UCI datasets. Meanwhile, it is compared with three mixed data clustering algorithms. The experimental results show that the proposed algorithm can yield better clustering precision and purity.
|
Received: 18 August 2014
|
|
Fund:Supported by Special Fund for Scientific Research in the Public Welfare of Ministry of Water Resources of China (No.201401044) |
About author:: QIAN Chaokai, born in 1990, Master student. His research interests include data mining.HUANG Decai (Corresponding author), born in 1958, Ph.D., Professor. His research interests include data mining and artificial intelligence. |
|
|
|
[1] 黄德才,沈仙桥,陆亿红.混合属性数据流的二重k近邻聚类算法.计算机科学, 2013, 40(10): 226-230. (HUANG D C, SHEN X Q, LU Y H. Double k-Nearest Neighbors of Heterogeneous Data Stream Clustering Algorithm. Computer Science, 2013, 40(10): 226-230.) [2] 王 骏,王士同,邓赵红.聚类分析研究中的若干问题.控制与决策, 2012, 27(3): 321-328. (WANG J, WANG S T, DENG Z H. Survey on Challenges in Clu-stering Analysis Research. Control and Decision, 2012, 27(3): 321-328.) [3] 王述云,张成洪,郝秀兰,等.基于免疫原理的数据流聚类算法.模式识别与人工智能, 2009, 22(2): 246-255. (WANG S Y, ZHANG C H, HAO X L, et al. Data Stream Clus-tering Based on Immune Principle. Pattern Recognition and Artificial Intelligence, 2009, 22(2): 246-255.) [4] KAUFMAN L, ROUSSEEUW P J. Finding Groups in Data: An Introduction to Cluster Analysis. New York, USA: John Wiley & Sons, 1990. [5] ESTER M, KRIEGEL H P, SANDER J, et al. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise // Proc of the 2nd International Conference on Knowledge Discovery and Data Mining. Oregon, USA, 1996: 226-231. [6] 刘海涛,魏汝祥,蒋国萍.软件成本数据的相似性度量.上海交通大学学报, 2012, 46(11): 1736-1740. (LIU H T, WEI R X, JIANG G P. Similarity Measurement of Software Cost Data. Journal of Shanghai Jiaotong University, 2012, 46(11): 1736-1740.) [7] HUANG Z X. Clustering Large Data Sets with Mixed Numeric and Categorical Values // Proc of the 1st Pacific-Asia Conference on Knowledge Discovery and Data Mining. Singapore, Singapore, 1997: 21-35. [8] 顾王一,朱 林,杨 杰.快速近似聚类算法及其在图像检索中的应用.上海交通大学学报, 2011, 45(2): 149-153. (GU W Y, ZHU L, YANG J. Fast Approximate Clustering Algorithm and Its Application in Image Retrieval. Journal of Shanghai Jiaotong University, 2011, 45(2): 149-153.) [9] CHATZIS S P. A Fuzzy c-means-Type Algorithm for Clustering of Data with Mixed Numeric and Categorical Attributes Employing a Probabilistic Dissimilarity Functional. Expert Systems with Applications, 2011, 38(7): 8684-8689. [10] 白 天,冀进朝,何加亮,等.混合属性数据聚类的新方法.吉林大学学报(工学版), 2013, 43(1): 130-134. (BAI T, JI J C, HE J L, et al. New Clustering Method of Mixed-Attribute Data. Journal of Jilin University (Engineering and Technology Edition), 2013, 43(1): 130-134.) [11] STREHL A, GHOSH J. Cluster Ensembles-A Knowledge Reuse Framework for Combining Multiple Partitions. Journal of Machine Learning Research, 2003, 3: 583-617.
[12] 赵 宇,李 兵,李 秀,等.混合属性数据聚类融合算法.清华大学学报(自然科学版), 2006, 46(10): 1673-1676. (ZHAO Y, LI B, LI X, et al. Cluster Ensemble Method for Databases with Mixed Numeric and Catagorical Values. Journal of Tsinghua University (Science & Technology), 2006, 46(10): 1673-1676.) [13] 何东晓,周 栩,王 佐,等.复杂网络社区挖掘——基于聚类融合的遗传算法.自动化学报, 2010, 36(8): 1160-1170. (HE D X, ZHOU X, WANG Z, et al. Community Mining in Complex Networks-Clustering Combination Based Genetic Algorithm. Acta Automatica Sinica, 2010, 36(8): 1160-1170.) [14] 李桃迎,陈 燕,张金松,等.基于聚类融合的混合属性数据增量聚类算法.控制与决策, 2012, 27(4): 603-608. (LI T Y, CHEN Y, ZHANG J S, et al. Incremental Clustering Algorithm of Mixed Numerical and Categorical Data Based on Clu-stering Ensemble. Control and Decision, 2012, 27(4): 603-608.) [15] HAND D J, VINCIOTTI V. Choosing k for Two-Class Nearest Neighbour Classifiers with Unbalanced Classes. Pattern Recognition Letters, 2003, 24(9/10): 1555-1562. |
|
|
|