Multi-observation I-nice Clustering Algorithm Based on Candidate Centers Fusion
CHEN Hongjie1,2, HE Yulin1,2, HUANG Zhexue1,2, YIN Jianfei1,2
1. Big Data Institute, College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060;
2. National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen 518060
With the rapid growth of data scale and composition complexity in the real-world applications, it is an important challenge for current clustering algorithms to estimate the number and the centers of clusters accurately in processing and analyzing the complex and large-scale data. The accurate estimation of cluster number and cluster centers is crucial for partial parametric clustering algorithm, complexity measurement and simplified representation of dataset. In this paper, grounded on the in-depth analysis of I-nice, a multi-observation I-nice clustering algorithm based on candidate centers fusion(I-niceCF) is proposed. Based on the original multi-observation projection divide-and-conquer framework, Gaussian mixture model(GMM) is combined with the coarse-to-fine optimal mixture model search strategy to partition data subsets exactly. In addition, GMM component vectors of candidate centers are constructed based on the distance of candidate centers from each observation point and optimal GMMs. A Minkowski distance pair is designed to measure the dissimilarity between candidate centers. Finally, the candidate centers are fused based on the mixture component vectors. Different from the existing clustering algorithms, I-niceCF is jointly optimized by data subset partitioning of divide-and-conquer process and candidate centers fusion. Consequently, accurate and efficient estimation for hundreds of clusters is achieved. A series of experiments on real and synthetic datasets show that I-niceCF can estimate cluster number and cluster centers more accurately with higher clustering accuracy and its stability under various data scenarios is verified.
[1] MACQUEEN J. Some Methods for Classification and Analysis of Multivariate Observations // Proc of the 5th Berkeley Symposium on Mathematical Statistics and Probability. Berkeley, USA: University of California, 1967: 281-297.
[2] NG A Y, JORDAN M I, WEISS Y. On Spectral Clustering: Analysis and an Algorithm // DIETTERICH T G, BECKER S, GHAHRAMANI Z, eds. Advances in Neural Information Processing Systems 14. Cambridge, USA: The MIT Press, 2002: 849-856.
[3] MAITRA R. Initializing Partition-Optimization Algorithms. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2009, 6(1): 144-157.
[4] HU Q B, XIE S H, LIN S Y, et al. Clustering Embedded Approaches for Efficient Information Network Inference. Data Science and Engineering, 2016, 1: 29-40.
[5] RAY S, TURI R H. Determination of Number of Clusters in k-means Clustering and Application in Colour Image Segmentation // Proc of the 4th International Conference on Advances in Pattern Re-cognition and Digital Techniques. Berlin, Germany: Springer, 1999: 137-143.
[6] DE AMORIM R C, HENNIG C. Recovering the Number of Clusters in Data Sets with Noise Features Using Feature Rescaling Factors. Information Sciences, 2015, 324: 126-145.
[7] HENNIG C, LIAO T F. How to Find an Appropriate Clustering for Mixed-Type Variables with Application to Socio-Economic Stratification. Journal of the Royal Statistical Society(Applied Statistics), 2013, 62(3): 309-369.
[8] FORGEY E. Cluster Analysis of Multivariate Data: Efficiency vs. Interpretability of Classification. Biometrics, 1965, 21(3): 768-769.
[9] MILLER J W, HARRISON M T. Mixture Models with a Prior on the Number of Components. Journal of the American Statistical Association, 2018, 113(521): 340-356.
[10] AYED F, CARON F. Nonnegative Bayesian Nonparametric Factor Models with Completely Random Measures for Community Detection. Statistics and Computing, 2019, 31(5). DOI: 10.1007/s11222-021-10037-3.
[11] THORNDIKE R L. Who Belongs in the Family? Psychometrika, 1953, 18(4): 267-276.
[12] ROUSSEEUW P J. Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. Journal of Computational and Applied Mathematics, 1987, 20: 53-65.
[13] TIBSHIRANI R, WALTHER G, HASTIE T. Estimating the Number of Clusters in a Data Set via the Gap Statistic. Journal of the Royal Statistical Society(Statistical Methodology), 2001, 63(2): 411-423.
[14] DUDOIT S, FRIDLYAND J. A Prediction-Based Resampling Me-thod for Estimating the Number of Clusters in a Dataset. Genome Biology, 2002, 3(7). DOI: 10.1186/gb-2002-3-7-research0036.
[15] SUGAR C A, JAMES G M. Finding the Number of Clusters in a Dataset: An Information-Theoretic Approach. Journal of the American Statistical Association, 2003, 98(463): 750-763.
[16] MOHAJER M, ENGLMEIER K H, SCHMID V J. A Comparison of Gap Statistic Definitions with and without Logarithm Function[C/OL]. [2021-09-26].https://arxiv.org/pdf/1103.4767v1.pdf.
[17] YANG M S, WU K L. A Modified Mountain Clustering Algorithm. Pattern Analysis and Applications, 2005, 8(1): 125-138.
[18] SHARMA K K, SEAL A. Multi-view Spectral Clustering for Uncertain Objects. Information Sciences, 2021, 547: 723-745.
[19] MA Z Z, LAI Y P, XIE J Y, et al. Dirichlet Process Mixture of Generalized Inverted Dirichlet Distributions for Positive Vector Data with Extended Variational Inference. IEEE Transactions on Neural Networks and Learning Systems, 2021. DOI: 10.1109/TNNLS.2021.3072209.
[20] DINARI O, YU A, FREIFELD O, et al. Distributed MCMC Infe-rence in Dirichlet Process Mixture Models Using Julia // Proc of the 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. Washington, USA: IEEE, 2019: 518-525.
[21] WANG C, PAN S R, HU R Q, et al. Attributed Graph Clus-tering: A Deep Attentional Embedding Approach // Proc of the 28th International Joint Conference on Artificial Intelligence. San Francisco, USA: IJCAI, 2019: 3670-3676.
[22] ZHU X J. Semi-Supervised Learning Literature Survey. Technical Report, 1530. Madison, USA: University of Wisconsin-Madison, 2005.
[23] ESTER M, KRIEGEL H P, SANDER J, et al. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise // Proc of the 2nd International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM, 1996: 226-231.
[24] RODRIGUEZ A, LAIO A. Clustering by Fast Search and Find of Density Peaks. Science, 2014, 344(6191): 1492-1496.
[25] MASUD M A, HUANG J Z, WEI C H, et al. I-nice: A New Approach for Identifying the Number of Clusters and Initial Cluster Centres. Information Sciences, 2018, 466: 129-151.
[26] HURVICH C M, TSAI C L. Regression and Time Series Model Selection in Small Samples. Biometrika, 1989, 76(2): 297-307.
[27] SUGIURA N. Further Analysts of the Data by Akaike's Information Criterion and the Finite Corrections: Further Analysts of the Data by Akaike's. Communications in Statistics-Theory and Methods, 1978, 7(1): 13-26.
[28] BECKMANN N, KRIEGEL H P, SCHNEIDER R, et al. The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles. ACM SIGMOD Record, 1990, 19(2): 322-331.
[29] ANKERST M, BREUNIG M M, KRIEGEL H P, et al. OPTICS: Ordering Points to Identify the Clustering Structure. ACM Sigmod Record, 1999, 28(2): 49-60.
[30] XU X W, ESTER M, KRIEGEL H P, et al. A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases // Proc of the 14th International Conference on Data Engineering. Wa-shington, USA: IEEE, 1998: 324-331.
[31] KAUFMAN L, ROUSSEEUW P J. Finding Groups in Data: An Introduction to Cluster Analysis. New York, USA: John Wiley & Sons, 2009.
[32] ZHANG T, RAMAKRISHNAN R, LIVNY M. BIRCH: An Efficient Data Clustering Method for Very Large Databases. ACM Sigmod Record, 1996, 25(2): 103-114.
[33] HULL J J. A Database for Handwritten Text Recognition Research. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1994, 16(5): 550-554.
[34] HUBERT L, ARABIE P. Comparing Partitions. Journal of Classification, 1985, 2(1): 193-218.
[35] VINH N X, EPPS J, BAILEY J. Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance. Journal of Machine Learning Research, 2010, 11: 2837-2854.
[36] DUNN J C. A Graph Theoretic Analysis of Pattern Classification via Tamura's Fuzzy Relation. IEEE Transactions on Systems, Man, and Cybernetics, 1974, 4(3): 310-313.
[37] BEZDEK J C. Objective Function Clustering // BEZDEK J C, ed. Pattern Recognition with Fuzzy Objective Function Algorithms. Berlin, Germany: Springer, 1981: 43-93.
[38] WANG S L, LI Q, ZHAO C F, et al. Extreme Clustering-A Clustering Method via Density Extreme Points. Information Sciences, 2021, 542: 24-39.
[39] HE Y L, WU Y Y, QIN H L, et al. Improved I-nice Clustering Algorithm Based on Density Peaks Mechanism. Information Sciences, 2021, 548: 177-190.