基于候选中心融合的多观测点I-nice聚类算法

doi:10.16451/j.cnki.issn1003-6059.202204005

摘要
图/表
参考文献(39)
相关文章 (15)

全文: PDF (1947 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要

伴随着问题场景数据在规模上的快速增长和构成上的复杂化,精确估计簇的个数和簇的中心点是当下聚类算法处理和分析复杂大规模数据的重要挑战.簇数及簇心的精确估计对于部分有参聚类算法、数据集整体复杂性度量和数据简化表示等都十分关键.文中在深入分析I-nice的基础上,提出基于候选中心融合的多观测点I-nice聚类算法.在原多观测点投影分治框架上采用混合高斯模型(Gaussian Mixture Model, GMM),结合粗细粒度最佳GMM搜索策略,实现数据子集的精确划分.此外,基于候选中心点分别到各观测点的距离值及最佳GMM,构造候选中心点的GMM构件向量,并设计一组闵可夫斯基距离对进行候选中心点间的相异度度量,实现基于GMM构件向量相异度的多观测点I-nice候选中心融合.不同于现有聚类算法,文中算法联合优化分治环节数据子集划分和候选中心集成这两个关键过程,实现成百上千个簇的精确高效估计.在真实数据集和仿真数据集上的一系列实验表明,文中算法能精确估计簇数和簇中心,具备较高的聚类精度.实验同时验证算法的有效性及在各类数据场景下的稳定性.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	陈鸿杰
	何玉林
	黄哲学
	尹剑飞

关键词 ：无监督学习, 观测点, I-nice, 无参聚类, 高斯混合模型

Abstract：

With the rapid growth of data scale and composition complexity in the real-world applications, it is an important challenge for current clustering algorithms to estimate the number and the centers of clusters accurately in processing and analyzing the complex and large-scale data. The accurate estimation of cluster number and cluster centers is crucial for partial parametric clustering algorithm, complexity measurement and simplified representation of dataset. In this paper, grounded on the in-depth analysis of I-nice, a multi-observation I-nice clustering algorithm based on candidate centers fusion(I-niceCF) is proposed. Based on the original multi-observation projection divide-and-conquer framework, Gaussian mixture model(GMM) is combined with the coarse-to-fine optimal mixture model search strategy to partition data subsets exactly. In addition, GMM component vectors of candidate centers are constructed based on the distance of candidate centers from each observation point and optimal GMMs. A Minkowski distance pair is designed to measure the dissimilarity between candidate centers. Finally, the candidate centers are fused based on the mixture component vectors. Different from the existing clustering algorithms, I-niceCF is jointly optimized by data subset partitioning of divide-and-conquer process and candidate centers fusion. Consequently, accurate and efficient estimation for hundreds of clusters is achieved. A series of experiments on real and synthetic datasets show that I-niceCF can estimate cluster number and cluster centers more accurately with higher clustering accuracy and its stability under various data scenarios is verified.

Key words： Unsupervised Learning Observation Point I-nice Parameter-Free Clustering Gaussian Mixture Model

收稿日期: 2021-10-25

ZTFLH:

TP301

基金资助:

国家自然科学基金面上项目(No.61972261)、深圳市基础研究项目(No.JCYJ20210324093609026,JCYJ20200813091134001)

通讯作者: 何玉林,博士,副研究员,主要研究方向为数据挖掘、机器学习、大数据处理与分析、大数据近似计算技术、多样本统计分析理论与方法等.E-mail:yulinhe@szu.edu.cn.

作者简介: 陈鸿杰,硕士研究生,主要研究方向为数据挖掘、大数据处理与分析、大数据系统计算技术.E-mail:chenhongjie2016@email.szu.edu.cn. 黄哲学,博士,教授,主要研究方向为数据挖掘、机器学习、大数据处理与分析、大数据系统计算技术等.E-mail:zx.huang@szu.edu.cn. 尹剑飞,博士,副教授.主要研究方向为大数据处理与分析、机器学习、数据挖掘、数值优化等.E-mail:yjf@szu.edu.cn.

引用本文:

陈鸿杰, 何玉林, 黄哲学, 尹剑飞. 基于候选中心融合的多观测点I-nice聚类算法[J]. 模式识别与人工智能, 2022, 35(4): 348-362. CHEN Hongjie, HE Yulin, HUANG Zhexue, YIN Jianfei. Multi-observation I-nice Clustering Algorithm Based on Candidate Centers Fusion. Pattern Recognition and Artificial Intelligence, 2022, 35(4): 348-362.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/10.16451/j.cnki.issn1003-6059.202204005 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2022/V35/I4/348

[1] MACQUEEN J. Some Methods for Classification and Analysis of Multivariate Observations // Proc of the 5th Berkeley Symposium on Mathematical Statistics and Probability. Berkeley, USA: University of California, 1967: 281-297.
[2] NG A Y, JORDAN M I, WEISS Y. On Spectral Clustering: Analysis and an Algorithm // DIETTERICH T G, BECKER S, GHAHRAMANI Z, eds. Advances in Neural Information Processing Systems 14. Cambridge, USA: The MIT Press, 2002: 849-856.
[3] MAITRA R. Initializing Partition-Optimization Algorithms. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2009, 6(1): 144-157.
[4] HU Q B, XIE S H, LIN S Y, et al. Clustering Embedded Approaches for Efficient Information Network Inference. Data Science and Engineering, 2016, 1: 29-40.
[5] RAY S, TURI R H. Determination of Number of Clusters in k-means Clustering and Application in Colour Image Segmentation // Proc of the 4th International Conference on Advances in Pattern Re-cognition and Digital Techniques. Berlin, Germany: Springer, 1999: 137-143.
[6] DE AMORIM R C, HENNIG C. Recovering the Number of Clusters in Data Sets with Noise Features Using Feature Rescaling Factors. Information Sciences, 2015, 324: 126-145.
[7] HENNIG C, LIAO T F. How to Find an Appropriate Clustering for Mixed-Type Variables with Application to Socio-Economic Stratification. Journal of the Royal Statistical Society(Applied Statistics), 2013, 62(3): 309-369.
[8] FORGEY E. Cluster Analysis of Multivariate Data: Efficiency vs. Interpretability of Classification. Biometrics, 1965, 21(3): 768-769.
[9] MILLER J W, HARRISON M T. Mixture Models with a Prior on the Number of Components. Journal of the American Statistical Association, 2018, 113(521): 340-356.
[10] AYED F, CARON F. Nonnegative Bayesian Nonparametric Factor Models with Completely Random Measures for Community Detection. Statistics and Computing, 2019, 31(5). DOI: 10.1007/s11222-021-10037-3.
[11] THORNDIKE R L. Who Belongs in the Family? Psychometrika, 1953, 18(4): 267-276.
[12] ROUSSEEUW P J. Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. Journal of Computational and Applied Mathematics, 1987, 20: 53-65.
[13] TIBSHIRANI R, WALTHER G, HASTIE T. Estimating the Number of Clusters in a Data Set via the Gap Statistic. Journal of the Royal Statistical Society(Statistical Methodology), 2001, 63(2): 411-423.
[14] DUDOIT S, FRIDLYAND J. A Prediction-Based Resampling Me-thod for Estimating the Number of Clusters in a Dataset. Genome Biology, 2002, 3(7). DOI: 10.1186/gb-2002-3-7-research0036.
[15] SUGAR C A, JAMES G M. Finding the Number of Clusters in a Dataset: An Information-Theoretic Approach. Journal of the American Statistical Association, 2003, 98(463): 750-763.
[16] MOHAJER M, ENGLMEIER K H, SCHMID V J. A Comparison of Gap Statistic Definitions with and without Logarithm Function[C/OL]. [2021-09-26].https://arxiv.org/pdf/1103.4767v1.pdf.
[17] YANG M S, WU K L. A Modified Mountain Clustering Algorithm. Pattern Analysis and Applications, 2005, 8(1): 125-138.
[18] SHARMA K K, SEAL A. Multi-view Spectral Clustering for Uncertain Objects. Information Sciences, 2021, 547: 723-745.
[19] MA Z Z, LAI Y P, XIE J Y, et al. Dirichlet Process Mixture of Generalized Inverted Dirichlet Distributions for Positive Vector Data with Extended Variational Inference. IEEE Transactions on Neural Networks and Learning Systems, 2021. DOI: 10.1109/TNNLS.2021.3072209.
[20] DINARI O, YU A, FREIFELD O, et al. Distributed MCMC Infe-rence in Dirichlet Process Mixture Models Using Julia // Proc of the 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. Washington, USA: IEEE, 2019: 518-525.
[21] WANG C, PAN S R, HU R Q, et al. Attributed Graph Clus-tering: A Deep Attentional Embedding Approach // Proc of the 28th International Joint Conference on Artificial Intelligence. San Francisco, USA: IJCAI, 2019: 3670-3676.
[22] ZHU X J. Semi-Supervised Learning Literature Survey. Technical Report, 1530. Madison, USA: University of Wisconsin-Madison, 2005.
[23] ESTER M, KRIEGEL H P, SANDER J, et al. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise // Proc of the 2nd International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM, 1996: 226-231.
[24] RODRIGUEZ A, LAIO A. Clustering by Fast Search and Find of Density Peaks. Science, 2014, 344(6191): 1492-1496.
[25] MASUD M A, HUANG J Z, WEI C H, et al. I-nice: A New Approach for Identifying the Number of Clusters and Initial Cluster Centres. Information Sciences, 2018, 466: 129-151.
[26] HURVICH C M, TSAI C L. Regression and Time Series Model Selection in Small Samples. Biometrika, 1989, 76(2): 297-307.
[27] SUGIURA N. Further Analysts of the Data by Akaike's Information Criterion and the Finite Corrections: Further Analysts of the Data by Akaike's. Communications in Statistics-Theory and Methods, 1978, 7(1): 13-26.
[28] BECKMANN N, KRIEGEL H P, SCHNEIDER R, et al. The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles. ACM SIGMOD Record, 1990, 19(2): 322-331.
[29] ANKERST M, BREUNIG M M, KRIEGEL H P, et al. OPTICS: Ordering Points to Identify the Clustering Structure. ACM Sigmod Record, 1999, 28(2): 49-60.
[30] XU X W, ESTER M, KRIEGEL H P, et al. A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases // Proc of the 14th International Conference on Data Engineering. Wa-shington, USA: IEEE, 1998: 324-331.
[31] KAUFMAN L, ROUSSEEUW P J. Finding Groups in Data: An Introduction to Cluster Analysis. New York, USA: John Wiley & Sons, 2009.
[32] ZHANG T, RAMAKRISHNAN R, LIVNY M. BIRCH: An Efficient Data Clustering Method for Very Large Databases. ACM Sigmod Record, 1996, 25(2): 103-114.
[33] HULL J J. A Database for Handwritten Text Recognition Research. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1994, 16(5): 550-554.
[34] HUBERT L, ARABIE P. Comparing Partitions. Journal of Classification, 1985, 2(1): 193-218.
[35] VINH N X, EPPS J, BAILEY J. Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance. Journal of Machine Learning Research, 2010, 11: 2837-2854.
[36] DUNN J C. A Graph Theoretic Analysis of Pattern Classification via Tamura's Fuzzy Relation. IEEE Transactions on Systems, Man, and Cybernetics, 1974, 4(3): 310-313.
[37] BEZDEK J C. Objective Function Clustering // BEZDEK J C, ed. Pattern Recognition with Fuzzy Objective Function Algorithms. Berlin, Germany: Springer, 1981: 43-93.
[38] WANG S L, LI Q, ZHAO C F, et al. Extreme Clustering-A Clustering Method via Density Extreme Points. Information Sciences, 2021, 542: 24-39.
[39] HE Y L, WU Y Y, QIN H L, et al. Improved I-nice Clustering Algorithm Based on Density Peaks Mechanism. Information Sciences, 2021, 548: 177-190.