基于候选中心融合的多观测点I-nice聚类算法

doi:10.16451/j.cnki.issn1003-6059.202204005

Abstract
Figure/Table
References (39)
Related Citation (15)

Download: PDF (1947 KB) HTML (1 KB)
Export: BibTeX | EndNote (RIS)

Abstract

With the rapid growth of data scale and composition complexity in the real-world applications, it is an important challenge for current clustering algorithms to estimate the number and the centers of clusters accurately in processing and analyzing the complex and large-scale data. The accurate estimation of cluster number and cluster centers is crucial for partial parametric clustering algorithm, complexity measurement and simplified representation of dataset. In this paper, grounded on the in-depth analysis of I-nice, a multi-observation I-nice clustering algorithm based on candidate centers fusion(I-niceCF) is proposed. Based on the original multi-observation projection divide-and-conquer framework, Gaussian mixture model(GMM) is combined with the coarse-to-fine optimal mixture model search strategy to partition data subsets exactly. In addition, GMM component vectors of candidate centers are constructed based on the distance of candidate centers from each observation point and optimal GMMs. A Minkowski distance pair is designed to measure the dissimilarity between candidate centers. Finally, the candidate centers are fused based on the mixture component vectors. Different from the existing clustering algorithms, I-niceCF is jointly optimized by data subset partitioning of divide-and-conquer process and candidate centers fusion. Consequently, accurate and efficient estimation for hundreds of clusters is achieved. A series of experiments on real and synthetic datasets show that I-niceCF can estimate cluster number and cluster centers more accurately with higher clustering accuracy and its stability under various data scenarios is verified.

Key words： Unsupervised Learning Observation Point I-nice Parameter-Free Clustering Gaussian Mixture Model

Received: 25 October 2021

ZTFLH:

TP301

Fund:

National Natural Science Foundation of China(No.61972261), Basic Research Foundations of Shenzhen(No.JCYJ20210324093609026,JCYJ20200813091134001)

Corresponding Authors: HE Yulin, Ph.D., associate professor. His research interests include data mining, machine learning, big data processing and analysis, big data approximate computing, and theories and methods of multi-sample statistics.

About author:: CHEN Hongjie, master student. His research interests include data mining, big data processing and analysis, and big data system computing techno-logy. HUANG Zhexue, Ph.D., professor. His research interests include data mining, machine learning, big data processing and analysis, and big data system computing technology. YIN Jianfei, Ph.D., associate professor. His research interests include big data processing and analysis, machine learning, data mining and numerical optimization.

	Service

	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	CHEN Hongjie
	HE Yulin
	HUANG Zhexue
	YIN Jianfei

Cite this article:

CHEN Hongjie,HE Yulin,HUANG Zhexue等. Multi-observation I-nice Clustering Algorithm Based on Candidate Centers Fusion[J]. Pattern Recognition and Artificial Intelligence, 2022, 35(4): 348-362.

URL:

http://manu46.magtech.com.cn/Jweb_prai/EN/10.16451/j.cnki.issn1003-6059.202204005 OR http://manu46.magtech.com.cn/Jweb_prai/EN/Y2022/V35/I4/348

[1] MACQUEEN J. Some Methods for Classification and Analysis of Multivariate Observations // Proc of the 5th Berkeley Symposium on Mathematical Statistics and Probability. Berkeley, USA: University of California, 1967: 281-297.
[2] NG A Y, JORDAN M I, WEISS Y. On Spectral Clustering: Analysis and an Algorithm // DIETTERICH T G, BECKER S, GHAHRAMANI Z, eds. Advances in Neural Information Processing Systems 14. Cambridge, USA: The MIT Press, 2002: 849-856.
[3] MAITRA R. Initializing Partition-Optimization Algorithms. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2009, 6(1): 144-157.
[4] HU Q B, XIE S H, LIN S Y, et al. Clustering Embedded Approaches for Efficient Information Network Inference. Data Science and Engineering, 2016, 1: 29-40.
[5] RAY S, TURI R H. Determination of Number of Clusters in k-means Clustering and Application in Colour Image Segmentation // Proc of the 4th International Conference on Advances in Pattern Re-cognition and Digital Techniques. Berlin, Germany: Springer, 1999: 137-143.
[6] DE AMORIM R C, HENNIG C. Recovering the Number of Clusters in Data Sets with Noise Features Using Feature Rescaling Factors. Information Sciences, 2015, 324: 126-145.
[7] HENNIG C, LIAO T F. How to Find an Appropriate Clustering for Mixed-Type Variables with Application to Socio-Economic Stratification. Journal of the Royal Statistical Society(Applied Statistics), 2013, 62(3): 309-369.
[8] FORGEY E. Cluster Analysis of Multivariate Data: Efficiency vs. Interpretability of Classification. Biometrics, 1965, 21(3): 768-769.
[9] MILLER J W, HARRISON M T. Mixture Models with a Prior on the Number of Components. Journal of the American Statistical Association, 2018, 113(521): 340-356.
[10] AYED F, CARON F. Nonnegative Bayesian Nonparametric Factor Models with Completely Random Measures for Community Detection. Statistics and Computing, 2019, 31(5). DOI: 10.1007/s11222-021-10037-3.
[11] THORNDIKE R L. Who Belongs in the Family? Psychometrika, 1953, 18(4): 267-276.
[12] ROUSSEEUW P J. Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. Journal of Computational and Applied Mathematics, 1987, 20: 53-65.
[13] TIBSHIRANI R, WALTHER G, HASTIE T. Estimating the Number of Clusters in a Data Set via the Gap Statistic. Journal of the Royal Statistical Society(Statistical Methodology), 2001, 63(2): 411-423.
[14] DUDOIT S, FRIDLYAND J. A Prediction-Based Resampling Me-thod for Estimating the Number of Clusters in a Dataset. Genome Biology, 2002, 3(7). DOI: 10.1186/gb-2002-3-7-research0036.
[15] SUGAR C A, JAMES G M. Finding the Number of Clusters in a Dataset: An Information-Theoretic Approach. Journal of the American Statistical Association, 2003, 98(463): 750-763.
[16] MOHAJER M, ENGLMEIER K H, SCHMID V J. A Comparison of Gap Statistic Definitions with and without Logarithm Function[C/OL]. [2021-09-26].https://arxiv.org/pdf/1103.4767v1.pdf.
[17] YANG M S, WU K L. A Modified Mountain Clustering Algorithm. Pattern Analysis and Applications, 2005, 8(1): 125-138.
[18] SHARMA K K, SEAL A. Multi-view Spectral Clustering for Uncertain Objects. Information Sciences, 2021, 547: 723-745.
[19] MA Z Z, LAI Y P, XIE J Y, et al. Dirichlet Process Mixture of Generalized Inverted Dirichlet Distributions for Positive Vector Data with Extended Variational Inference. IEEE Transactions on Neural Networks and Learning Systems, 2021. DOI: 10.1109/TNNLS.2021.3072209.
[20] DINARI O, YU A, FREIFELD O, et al. Distributed MCMC Infe-rence in Dirichlet Process Mixture Models Using Julia // Proc of the 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. Washington, USA: IEEE, 2019: 518-525.
[21] WANG C, PAN S R, HU R Q, et al. Attributed Graph Clus-tering: A Deep Attentional Embedding Approach // Proc of the 28th International Joint Conference on Artificial Intelligence. San Francisco, USA: IJCAI, 2019: 3670-3676.
[22] ZHU X J. Semi-Supervised Learning Literature Survey. Technical Report, 1530. Madison, USA: University of Wisconsin-Madison, 2005.
[23] ESTER M, KRIEGEL H P, SANDER J, et al. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise // Proc of the 2nd International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM, 1996: 226-231.
[24] RODRIGUEZ A, LAIO A. Clustering by Fast Search and Find of Density Peaks. Science, 2014, 344(6191): 1492-1496.
[25] MASUD M A, HUANG J Z, WEI C H, et al. I-nice: A New Approach for Identifying the Number of Clusters and Initial Cluster Centres. Information Sciences, 2018, 466: 129-151.
[26] HURVICH C M, TSAI C L. Regression and Time Series Model Selection in Small Samples. Biometrika, 1989, 76(2): 297-307.
[27] SUGIURA N. Further Analysts of the Data by Akaike's Information Criterion and the Finite Corrections: Further Analysts of the Data by Akaike's. Communications in Statistics-Theory and Methods, 1978, 7(1): 13-26.
[28] BECKMANN N, KRIEGEL H P, SCHNEIDER R, et al. The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles. ACM SIGMOD Record, 1990, 19(2): 322-331.
[29] ANKERST M, BREUNIG M M, KRIEGEL H P, et al. OPTICS: Ordering Points to Identify the Clustering Structure. ACM Sigmod Record, 1999, 28(2): 49-60.
[30] XU X W, ESTER M, KRIEGEL H P, et al. A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases // Proc of the 14th International Conference on Data Engineering. Wa-shington, USA: IEEE, 1998: 324-331.
[31] KAUFMAN L, ROUSSEEUW P J. Finding Groups in Data: An Introduction to Cluster Analysis. New York, USA: John Wiley & Sons, 2009.
[32] ZHANG T, RAMAKRISHNAN R, LIVNY M. BIRCH: An Efficient Data Clustering Method for Very Large Databases. ACM Sigmod Record, 1996, 25(2): 103-114.
[33] HULL J J. A Database for Handwritten Text Recognition Research. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1994, 16(5): 550-554.
[34] HUBERT L, ARABIE P. Comparing Partitions. Journal of Classification, 1985, 2(1): 193-218.
[35] VINH N X, EPPS J, BAILEY J. Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance. Journal of Machine Learning Research, 2010, 11: 2837-2854.
[36] DUNN J C. A Graph Theoretic Analysis of Pattern Classification via Tamura's Fuzzy Relation. IEEE Transactions on Systems, Man, and Cybernetics, 1974, 4(3): 310-313.
[37] BEZDEK J C. Objective Function Clustering // BEZDEK J C, ed. Pattern Recognition with Fuzzy Objective Function Algorithms. Berlin, Germany: Springer, 1981: 43-93.
[38] WANG S L, LI Q, ZHAO C F, et al. Extreme Clustering-A Clustering Method via Density Extreme Points. Information Sciences, 2021, 542: 24-39.
[39] HE Y L, WU Y Y, QIN H L, et al. Improved I-nice Clustering Algorithm Based on Density Peaks Mechanism. Information Sciences, 2021, 548: 177-190.