Research Progress of Deep Clustering Based on Unsupervised Representation Learning
HOU Haiwei1, DING Shifei1,2, XU Xiao1,2
1. School of Computer Science and Technology, China University of Mining and Technology, Xuzhou 221116; 2. Engineering Research Center of Mine Digitization of Ministry of Education, China University of Mining and Technology, Xuzhou 221116
Abstract:In the era of big data, data usually has the characteristics of large scale, high dimension and complex structure. Deep learning is utilized to combine representation learning and clustering tasks in deep clustering. Therefore, the performance of deep clustering for large-scale and high-dimensional data is greatly improved. The development of deep clustering is rarely summarized from the perspective of representation learning. The difference between traditional and deep clustering algorithms and the heterogeneity of deep clustering algorithms are seldom analyzed. Firstly, common clustering algorithms in deep clustering are summarized. Deep clustering algorithms are divided into generative and discriminative models based deep clustering algorithms, and representation learning process of deep models in clustering tasks is analyzed. Secondly, the comparative analysis of multiple types of algorithms is carried out through experiments. And the advantages and disadvantages of different algorithms are summarized to select models for specific tasks. Finally, application scenarios are described and the future development trend of deep clustering is discussed.
[1] 朱杰,陈黎飞.核密度估计的聚类算法.模式识别与人工智能,2017, 30(5): 439-447. (ZHU J, CHEN L F.Clustering Algorithm with Kernel Density Estimation. Pattern Recognition and Artificial Intelligence, 2017, 30(5): 439-447.) [2] 古天龙,李龙,常亮,等.公平机器学习:概念、分析与设计.计算机学报, 2022, 45(5): 1018-1051. (GU T L, LI L, CHANG L, et al. Fair Machine Learning: Concepts, Analysis, and Design. Chinese Journal of Computers, 2022, 45(5): 1018-1051.) [3] 化春键,张爱榕,蒋毅,等.基于改进模糊C均值聚类算法的草坪杂草识别.华南农业大学学报, 2022, 43(3): 107-115. (HUA C J, ZHANG A R, JIANG Y, et al. Lawn Weed Recognition Based on Improved Fuzzy C-means Clustering Algorithm. Journal of South China Agricultural University, 2022, 43(3): 107-115.) [4] 卢宏涛,罗沐昆.基于深度学习的计算机视觉研究新进展.数据采集与处理, 2022, 37(2): 247-278. (LU H T, LUO M K.Survey on New Progresses of Deep Learning Based Computer Vision. Journal of Data Acquisition and Processing, 2022, 37(2): 247-278.) [5] 王鑫,张鑫,宁晨.基于多特征降维和迁移学习的红外人体目标识别方法.计算机应用, 2019, 39(12): 3490-3495. (WANG X, ZHANG X, NING C.Infrared Human Target Recognition Method Based on Multi-feature Dimensionality Reduction and Transfer Learning. Journal of Computer Applications, 2019, 39(12): 3490-3495.) [6] 韩敏,李宇,韩冰.基于改进结构保持数据降维方法的故障诊断研究.自动化学报, 2021, 47(2): 338-348. (HAN M, LI Y, HAN B.Research on Fault Diagnosis of Data Dimension Reduction Based on Improved Structure Preserving Algorithm. Acta Automatica Sinica, 2021, 47(2): 338-348.) [7] 季伟东,孙小晴,林平,等.基于非线性降维的自然计算方法.电子与信息学报, 2020, 42(8): 1982-1989. (JI W D, SUN X Q, LIN P, et al. Natural Computing Method Based on Nonlinear Dimension Reduction. Journal of Electronics and Information Technology, 2020, 42(8): 1982-1989.) [8] LECUN Y, BENGIO Y, HINTON G. Deep Learning. Nature, 2015, 521: 436-444. [9] 刘睿馨,刘新媛,李晨.基于低秩表示的标记分布学习算法.模式识别与人工智能, 2021, 34(2): 146-156. (LIU R X, LIU X Y, LI C.Label Distribution Learning Method Based on Low-Rank Representation. Pattern Recognition and Artificial Intelligence, 2021, 34(2): 146-156.) [10] BASHAR A.Survey on Evolving Deep Learning Neural Network Architectures. Journal of Artificial Intelligence and Capsule Networks, 2019, 1(2): 73-82. [11] 杜鹏,丁世飞.基于混合词向量深度学习模型的DGA域名检测方法.计算机研究与发展, 2020, 57(2): 433-446. (DU P, DING S F.A DGA Domain Name Detection Method Based on Deep Learning Models with Mixed Word Embedding. Journal of Computer Research and Development, 2020, 57(2): 433-446.) [12] 吕坤儒,吴春国,梁艳春,等.融合语言模型的端到端中文语音识别算法.电子学报, 2021, 49(11): 2177-2185. (LÜ K R, WU C G, LIANG Y C, et al. An End-to-End Chinese Speech Recognition Algorithm Integrating Language Model. Acta Electronica Sinica, 2021, 49(11): 2177-2185.) [13] 寇大磊,权冀川,张仲伟.基于深度学习的目标检测框架进展研究.计算机工程与应用, 2019, 55(11): 25-34. (KOU D L, QUAN J C, ZHANG Z W.Research on Progress of Object Detection Framework Based on Deep Learning. Computer Engineering and Applications, 2019, 55(11): 25-34.) [14] 谢娟英,侯琦,曹嘉文.深度卷积自编码图像聚类算法.计算机科学与探索, 2019, 13(4): 586-595. (XIE J Y, HOU Q, CAO J W.Image Clustering Algorithms by Deep Convolutional Autoencoders. Journal of Frontiers of Computer Science and Technology, 2019, 13(4): 586-595.) [15] ZHOU S, XU H J, ZHENG Z N, et al. A Comprehensive Survey on Deep Clustering: Taxonomy, Challenges, and Future Directions[C/OL].[2022-07-15]. https://arxiv.org/pdf/2206.07579v1.pdf. [16] 贾宗维,崔军.一种发现社团结构的快速凝聚聚类算法.湘潭大学自然科学学报, 2012, 34(4): 103-107. (JIA Z W, CUI J.A Fast Agglomerate Clustering Algorithm for Detecting Community Structure. Natural Science Journal of Xiangtan University, 2012, 34(4): 103-107.) [17] 谢宜婷,王爱平,邹海.基于自顶向下分裂聚类的图像匹配算法研究.计算机应用研究, 2017, 34(5): 1590-1593. (XIE Y T, WANG A P, ZHOU H.Research of Image Matching Algorithm Based on Top-Down Split Clustering. Application Research of Computers, 2017, 34(5): 1590-1593.) [18] 徐晓,丁世飞,丁玲.密度峰值聚类算法研究进展.软件学报, 2022, 33(5): 1800-1816. (XU X, DING S F, DING L.Survey on Density Peaks Clustering Algorithm. Journal of Software, 2022, 33(5): 1800-1816.) [19] 柳菁,李琪.DisHAP:基于层次亲和聚类的分布式大图划分算法.电子学报, 2021, 49(10): 2002-2011. (LIU J, LI Q.DisHAP: A Distributed Partition Algorithm for Large Scale Graphs Based on Hierarchical Affinity Clustering. Acta Electronica Sinica, 2021, 49(10): 2002-2011.) [20] 袁泉,晏飞扬,文志云,等.基于谱聚类的社交网络差分隐私保护算法研究.计算机工程与科学, 2022, 44(2): 251-256. (YUAN Q, YAN F Y, WEN Z Y, et al. A Differential Privacy Protection Algorithm in Social Network Based on Spectral Clustering. Computer Engineering and Science, 2022, 44(2): 251-256.) [21] 万仁霞,王大庆,苗夺谦.基于三支决策的高斯混合聚类研究.重庆邮电大学学报(自然科学版), 2021, 33(5): 806-815. (WANG R X, WANG D Q, MIAO D Q.Gaussian Mixture Clustering Based on Three-Way Decision. Journal of Chongqing University of Posts and Telecommunications(Natural Science Edition), 2021, 33(5): 806-815.) [22] XU R B, CHE Y, WANG X M, et al. Stacked Autoencoder-Based Community Detection Method via an Ensemble Clustering Framework. Information Sciences, 2020, 526: 151-165. [23] XIE J Y, GIRSHICK R, FARHADI A.Unsupervised Deep Embedding for Clustering Analysis // Proc of the 33rd International Conference on Machine Learning. San Diego, USA: JMLR, 2016: 478-487. [24] GUO X F, CAO L, LIU X W, et al. Improved Deep Embedded Clustering with Local Structure Preservation // Proc of the 26th International Joint Conference on Artificial Intelligence. San Francisco, USA: IJCAI, 2017: 1753-1759. [25] GUO X F, LIU X W, ZHU E, et al. Deep Clustering with Convolutional Autoencoders // Proc of the International Conference on Neural Information Processing. Berlin, Germany: Springer, 2017: 373-382. [26] OPOCHINSKY Y, CHAZAN S E, GANNOT S, et al. K-Autoencoders Deep Clustering // Proc of the IEEE International Confe-rence on Acoustics, Speech and Signal Processing. Washington, USA: IEEE, 2020: 4037-4041. [27] CAI J Y, WANG S P, XU C Y, et al. Unsupervised Deep Clustering via Contractive Feature Representation and Focal Loss. Pattern Recognition, 2022, 123. DOI: 10.1016/j.patcog.2021.108386. [28] DIZAJI K G, HERANDI A, DENG C, et al. Deep Clustering via Joint Convolutional Autoencoder Embedding and Relative Entropy Minimization // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2017: 5747-5756. [29] YANG B, FU X, SIDIROPOULOS N D, et al. Towards K-means-Friendly Spaces: Simultaneous Deep Learning and Clustering // Proc of the 34th International Conference on Machine Learning. San Diego, USA: JMLR, 2017: 5888-5901. [30] YANG X, DENG C, ZHENG F, et al. Deep Spectral Clustering Using Dual Autoencoder Network // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2019: 4061-4070. [31] REN Y Z, WANG N, LI M X, et al. Deep Density-Based Image Clustering. Knowledge-Based Systems, 2020, 197. DOI: 10.1016/j.knosys.2020.105841 [32] MCCONVILLE R, SANTOS-RODRÍGUEZ R, PIECHOCKI R J, et al. N2D: (Not Too) Deep Clustering via Clustering the Local Manifold of an Autoencoded Embedding // Proc of the 25th International Conference on Pattern Recognition. Washington, USA: IEEE, 2021: 5145-5152. [33] YAN Y J, HAO H Y, XU B L, et al. Image Clustering via Deep Embedded Dimensionality Reduction and Probability-Based Triplet Loss. IEEE Transactions on Image Processing, 2020, 29: 5652-5661. [34] YANG X, DENG C, WEI K, et al. Adversarial Learning for Robust Deep Clustering // Proc of the 34th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2020: 9098-9108. [35] AGGARWAL A, MITTAL M, BATTINENI G.Generative Adversarial Network: An Overview of Theory and Applications. International Journal of Information Management Data Insights, 2021, 1(1). DOI: 10.1016/j.jjimei.2020.100004. [36] JIANG Z X, ZHENG Y, TAN H C, et al. Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering[C/OL].[2022-07-15]. https://arxiv.org/abs/1611.05148. [37] XU C Y, DAI Y F, LIN R J, et al. Deep Clustering by Maximizing Mutual Information in Variational Auto-Encoder. Knowledge-Based Systems, 2020, 205. DOI: 10.1016/j.knosys.2020.106260. [38] SPRINGENBERG J T.Unsupervised and Semi-Supervised Lear-ning with Categorical Generative Adversarial Networks[C/OL]. [2020-06-30].https://arxiv.org/pdf/1511.06390.pdf. [39] CHEN X, DUAN Y, HOUTHOOFT R, et al. InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets // Proc of the 30th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2016: 2180-2188. [40] GU X Y, GUO J C, XIAO L J, et al. Conditional Mutual Information-Based Feature Selection Algorithm for Maximal Relevance Mi-nimal Redundancy. Applied Intelligence, 2022, 52(2): 1436-1447. [41] YANG X J, YAN J C, CHENG Y, et al. Learning Deep Generative Clustering via Mutual Information Maximization. IEEE Tran-sactions on Neural Networks and Learning Systems, 2022. DOI: 10.1109/TNNLS.2021.3135375. [42] DIZAJI K G, WANG X Q, DENG C, et al. Balanced Self-Paced Learning for Generative Adversarial Clustering Network // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2019: 4386-4395. [43] LARSEN A B L, SϕNDERBY S K, LAROCHELLE H, et al. Autoencoding Beyond Pixels Using a Learned Similarity Metric // Proc of the 33rd International Conference on Machine learning. San Diego, USA: JMLR, 2016: 1558-1566. [44] YANG L, FAN W T, BOUGUILA N.Clustering Analysis via Deep Generative Models with Mixture Models. IEEE Transactions on Neural Networks and Learning Systems, 2022, 33(1): 340-350. [45] ZHANG Y, LIU Y, SUN P, et al. IFCNN: A General Image Fusion Framework Based on Convolutional Neural Network. Information Fusion, 2020, 54: 99-118. [46] YANG J W, PARIKH D, BATRA D.Joint Unsupervised Learning of Deep Representations and Image Clusters // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2016: 5147-5156. [47] CHANG J L, WANG L F, MENG G F, et al. Deep Adaptive Image Clustering // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2017: 5880-5888. [48] NIU C, ZHANG J, WANG G, et al. GATCluster: Self-Supervised Gaussian-Attention Network for Image Clustering // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2020: 735-751. [49] WU J L, LONG K Y, WANG F, et al. Deep Comprehensive Correlation Mining for Image Clustering // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2019: 8149-8158. [50] WU Z R, XIONG Y J, YU S X, et al. Unsupervised Feature Learning via Non-parametric Instance Discrimination // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2018: 3733-3742. [51] TAO Y L, TAKAGI K, NAKATA K.Clustering-Friendly Representation Learning via Instance Discrimination and Feature Decorrelation[C/OL]. [2022-07-15].https://arxiv.org/pdf/2106.00131.pdf. [52] TSAI T W, LI C X, ZHU J.Mice: Mixture of Contrastive Experts for Unsupervised Image Clustering[C/OL]. [2022-07-15].https://arxiv.org/pdf/2105.01899.pdf. [53] LI Y F, HU P, LIU Z T, et al. Contrastive Clustering. Procee-dings of the AAAI Conference on Artificial Intelligence, 2021, 35(10): 8547-8555. [54] ZHONG H S, WU J L, CHEN C, et al. Graph Contrastive Clustering // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2021: 9204-9213. [55] ZHONG H S, CHEN C, JIN Z M, et al. Deep Robust Clustering by Contrastive Learning[C/OL].[2022-07-15]. https://arxiv.org/pdf/2008.03030.pdf. [56] GORI M, MONFARDINI G, SCARSELLI F.A New Model for Learning in Graph Domains // Proc of the IEEE International Joint Conference on Neural Networks. Washington, USA: IEEE, 2005: 729-734. [57] BRUNA J, ZAREMBA W, SZLAM A, ,et al. Spectral Networks. Spectral Networks and Deep Locally Connected Networks on Graphs[C/OL]. [2022-07-15]. https://arxiv.org/pdf/1312.6203.pdf. [58] KIPF T N, WELLING M.Semi-Supervised Classification with Gra-ph Convolutional Networks[C/OL]. [2022-07-15].https://arxiv.org/pdf/1609.02907.pdf. [59] KIPF T N, WELLING M.Variational Graph Auto-Encoders[C/OL]. [2022-07-15].https://arxiv.org/pdf/1611.07308.pdf. [60] WANG C, PAN S R, HU R Q, et al. Attributed Graph Clustering: A Deep Attentional Embedding Approach // Proc of the 28th International Joint Conference on Artificial Intelligence. San Francisco, USA: IJCAI, 2019: 3670-3676. [61] BO D Y, WANG X, SHI C, et al. Structural Deep Clustering Network // Proc of the World Wide Web Conference. New York, USA: ACM, 2020: 1400-1410. [62] ZHANG H Y, LI P, ZHANG R, et al. Embedding Graph Auto-Encoder for Graph Clustering. IEEE Transactions on Neural Networks and Learning Systems, 2022. DOI: 10.1109/TNNLS.2022.3158654. [63] QI C, ZHANG J M, JIA H J, et al. Deep Face Clustering Using Residual Graph Convolutional Network. Knowledge-Based Systems, 2021. DOI: 10.1016/j.knosys.2020.106561. [64] BIANCHI F M, GRATTAROLA D, ALIPPI C.Spectral Clustering with Graph Neural Networks for Graph Pooling // Proc of the 37th International Conference on Machine Learning. San Diego, USA: JMLR, 2020: 874-883. [65] TAO Z Q, LIU H F, LI J, et al. Adversarial Graph Embedding for Ensemble Clustering // Proc of the 28th International Joint Confe-rence on Artificial Intelligence. San Francisco, USA: IJCAI, 2019: 3562-3568. [66] 杜航原,张晶,王文剑.一种深度自监督聚类集成算法.智能系统学报, 2020, 15(6): 1113-1120. (DU H Y, ZHANG J, WANG W J.A Deep Self-Supervised Clustering Ensemble Algorithm. CAAI Transactions on Intelligent Systems, 2020, 15(6): 1113-1120.) [67] ALJALBOUT E, GOLKOV V, SIDDIQUI Y, et al. Clustering with Deep Learning: Taxonomy and New Methods[C/OL].[2022-07-15]. https://arxiv.org/pdf/1801.07648.pdf. [68] MIN E X, GUO X F, LIU Q, et al. A Survey of Clustering with Deep Learning: From the Perspective of Network Architecture. IEEE Access, 2018, 6: 39501-39514. [69] ESTÉVEZ P A, TESMER M, PEREZ C A, et al. Normalized Mutual Information Feature Selection. IEEE Transactions on Neural Networks, 2009, 20(2): 189-201. [70] 李悦. 基于聚类算法的吉林大学校园新闻推荐系统的设计与实现.硕士学位论文.长春:吉林大学, 2017. (LI Y.Design and Implementation of Campus News Recommendation System Based on Clustering Algorithm in Jilin University. Master Dissertation. Changchun, China: Jilin University, 2017.) [71] 李文杰,薛花,张德干.一种融合时间因素的用户偏好和距离加权的聚类方法: CN109241203B.2021-08-31. (LI W J, XUE H, ZHANG D G.A User Preference and Distance Weighted Clustering Method Incorporating Time Factors: CN109241203B.2021-08-31. [72] 王振飞,陈金磊,郑志蕴,等.面向心血管疾病的自适应模块化神经网络预测模型.小型微型计算机系统, 2019, 40(1): 232-235. (WANG Z F, CHEN J L, ZHENG Z Y, et al. Adaptive Modula-rized Neural Network Prediction Model for Cardiovascular Disease. Journal of Chinese Computer Systems, 2019, 40(1): 232-235.) [73] 周峰. 基于神经网络模型的慢乙肝相关疾病患者聚类及医疗费用预测研究.硕士学位论文.广州:广东药科大学, 2018. (ZHOU F.Clustering of Hepatitis B Related Diseases and Prediction of Medical Expenditures Based on Neural Network Model. Master Dissertation. Guangzhou, China: Guangdong Pharmaceutical University, 2018.) [74] CHOWDHURY S, KHANZADEH M, AKULA R, et al. Botnet Detection Using Graph-Based Feature Clustering. Journal of Big Data, 2017, 4(1). DOI: 10.1186/s40537-017-0074-7. [75] EI-DIN Y S, MOUSTAFA M N, MAHDI H. Adversarial Unsupervised Domain Adaptation Guided with Deep Clustering for Face Presentation Attack Detection[C/OL].[2022-07-15]. https://arxiv.org/pdf/2102.06864.pdf.