Clustering Algorithm with Kernel Density Estimation
ZHU Jie1, CHEN Lifei2
1.Southwest China Institute of Electronic Technology, Chengdu 610036 2. College of Mathematics and Computer Science, Fujian Normal University, Fuzhou 350117
Abstract:Similarity measure is an important basis for clustering analysis. However, defining an efficient similarity measure for discrete symbols (categories) is difficult. In this paper, a method is proposed to measure the similarity between categories in terms of their kernel probability density. Different from the traditional simple-matching method or frequency-estimation method, under the action of the bandwidth for kernel functions, the proposed measure no longer depends on the assumption that categories on the same attribute are statistically independent. Then, a Bayesian clustering model is established based on kernel density estimation of categories, and a clustering algorithm is derived to optimize the clustering model using a likelihood-based object-to-cluster similarity measure. Finally, three data-driven approaches are proposed by leave-one-out estimation and maximum likelihood estimation to dynamically determine the optimal bandwidths in the kernel function for clustering. Experiments are conducted on real-world datasets and the results demonstrate that the proposed algorithm achieves higher clustering accuracy compared with the existing algorithms using a simple-matching distance measure or the attribute-weighting variants. The results also show that the bandwidth estimated by the proposed algorithm has practical significance in the applications, such as important feature identification.
[1] AGGARWAL C C. Data Mining: The Textbook. New York, USA: Springer, 2015. [2] 陈黎飞,吴 涛.数据挖掘中的特征约简.北京:科学出版社, 2016. (CHEN L F, WU T. Feature Reduction in Data Mining. Beijing, China: Science Press, 2016.) [3] 孙吉贵,刘 杰,赵连宇.聚类算法研究.软件学报, 2008, 19(1): 48-61. (SUN J G, LIU J, ZHAO L Y. Clustering Algorithms Research. Journal of Software, 2008, 19(1): 48-61.) [4] HUANG Z X. Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values. Data Mining and Know-ledge Discovery, 1998, 2(3): 283-304. [5] GUHA S, RASTOGI R, SHIM K. ROCK: A Robust Clustering Algorithm for Categorical Attributes // Proc of the 15th IEEE International Conference on Data Engineering. Washington, USA: IEEE, 1999. DOI: 10.1109/ICDE.1999.754967. [6] 文 顺,赵杰煜,朱绍军.基于贝叶斯和谐度的层次聚类.模式识别与人工智能, 2013, 26(12): 1161-1168. (WEN S, ZHAO J Y, ZHU S J. Hierarchical Clustering Based on a Bayesian Harmony Measure. Pattern Recognition and Artificial Intelligence, 2013, 26(12): 1161-1168.) [7] 梁吉业,白 亮,曹付元.基于新的距离度量的K-modes 聚类算法.计算机研究与发展, 2010, 47(10): 1749-1755. (LIANG J Y, BAI L, CAO F Y. K-modes Clustering Algorithm Based on a New Distance Measure. Journal of Computer Research and Development, 2010, 47(10): 1749-1755.) [8] SEN P K. Gini Diversity Index, Hamming Distance, and Curse of Dimensionality[J/OL]. [2016-08-20]. http://www3.unisi.it/eventi/GiniLorenz05/ABSTRACT_PAPER_24%20May/PAPER_Sen.pdf. [9] CAO F Y, LIANG J Y, LI D Y, et al. A Weighting K-modes Algorithm for Subspace Clustering of Categorical Data. Neurocomputing, 2013, 108: 23-30. [10] BAI L, LIANG J Y , DANG C Y, et al. A Novel Attribute Weighting Algorithm for Clustering High-Dimensional Categorical Data. Pattern Recognition, 2011, 44(12): 2843-2861. [11] BORIAH S, CHANDOLA V, KUMAR V. Similarity Measures for Categorical Data: A Comparative Evaluation // Proc of the 8th SIAM International Conference on Data Mining. Philadelphia, USA: SIAM, 2008: 243-254. [12] MURPHY K P. Machine Learning: A Probabilistic Perspective. London, UK: The MIT Press, 2012. [13] CHEN L F, JIANG Q S, WANG S R. Model-Based Method for Projective Clustering. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(7): 1291-1305. [14] JOHN G H, LANGLEY P. Estimating Continuous Distributions in Bayesian Classifiers // Proc of the 11th Annual Conference on Uncertainty in Artificial Intelligence. San Francisco, USA: Morgan Kaufmann, 1995: 338-345. [15] LEWIS D D. Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval // Proc of the 10th European Conference on Machine Learning. London, UK: Springer-Verlag, 1998: 4-15. [16] LI Q, RACINE J S. Nonparametric Econometrics: Theory and Practice. Princeton, USA: Princeton University Press, 2007. [17] CHEN L F, WANG S R, WANG K J, et al. Soft Subspace Clus-tering of Categorical Data with Probabilistic Distance. Pattern Recognition, 2016, 51: 322-332. [18] AITCHISON J, AITKEN C G G. Multivariate Binary Discrimination by the Kernel Method. Biometrika, 1976, 63(3): 413-420. [19] CHEN L F, WANG S R. Central Clustering of Categorical Data with Automated Feature Weighting // Proc of the 23rd International Joint Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2013: 1260-1266. [20] MCLACHLAN G, PEEL D. Finite Mixture Models. New York, USA: John Wiley & Sons, 2000. [21] PRESS W H, TEUKOLSKY S A, VETTERLING W T, et al. Numerical Recipes in C++: The Art of Scientific Computing. 2nd Edition. Cambridge, UK: Cambridge University Press, 2002. [22] SAHA A, DAS S. Categorical Fuzzy K-modes Clustering with Automated Feature Weight Learning. Neurocomputing, 2015, 166: 422-435. [23] YANG Y, WEBB G I. Proportional K-Interval Discretization for Naive-Bayes Classifiers // Proc of the 12th European Conference on Machine Learning. London, UK: Springer-Verlag, 2001: 564-575. [24] HUANG J Z, NG M K, RONG H Q, et al. Automated Variable Weighting in K-means Type Clustering. IEEE Transactions on Pa-ttern Analysis and Machine Intelligence, 2005, 27(5): 657-668. [25] NOORDEWIER M O, TOWELL G G, SHAVLIK J W. Training Knowledge-Based Neural Networks to Recognize Genes in DNA Sequences // LIPPMANN R P, MOODY J E, TOURETZKY D S, eds. Advances in Neural Information Processing Systems 3. San Francisco, USA: Morgan Kaufmann Publishers, 1990: 530-536.