Overlapping Subspace Clustering Based on Probabilistic Model
QIU Yunfei1,2, FEI Bowen2, LIU Daqian3
1.School of Software, Liaoning Technical University, Huludao 125105 2.School of Business Administration, Liaoning Technical University, Huludao 125105 3.School of Electronic and Information Engineering, Liaoning Technical University, Huludao 125105
Abstract:Due to the low clustering accuracy of the existing subspace clustering methods in dealing with the problem of overlapping clusters, an overlapping subspace clustering algorithm based on probability model(OSCPM) is proposed. Firstly, the high-dimensional data is divided into several subspaces by using the subspace representation of mixed-norm. Then, a probability model of the exponential family distribution is used to determine the overlapping part of the clusters in the subspace, and the data is assigned to the correct class clusters to get the clustering results. An alternating maximization method is used to determine the optimal solution of the objective function in the process of parameter estimation. Experimental results on artificial datasets and UCI datasets show that OSCPM produces better clustering performance compared with other algorithms and it is suitable for large scale datasets.
[1] SHARMA V K, BALA A. Clustering for High Dimensional Data // Proc of the 1st IEEE International Conference on Networks and Soft Computing. Washington, USA: IEEE, 2014: 365-369. [2] 朱 林,雷景生,毕忠勤,等.一种基于数据流的软子空间聚类算法.软件学报, 2013, 24(11): 2610-2627. (ZHU L, LEI J S, BI Z Q, et al. Soft Subspace Clustering for Streaming Data. Journal of Software, 2013, 24(11): 2610- 2627.) [3] ELHAMIFAR E, VIDAL R. Sparsity in Unions of Subspaces for Classification and Clustering of High-Dimensional Data // Proc of the 49th Annual Allerton Conference on Communication, Control, and Computing. Washington, USA: IEEE, 2011: 1085-1089. [4] BOUVEYRON C, GIRARD S, SCHMID C. High-Dimensional Data Clustering. Computational Statistics and Data Analysis, 2007, 52(1): 502-519. [5] TSAI C F, HUANG S C. An Effective and Efficient Grid-Based Data Clustering Algorithm Using Intuitive Neighbor Relationship for Data Mining // Proc of the International Conference on Machine Lear-ning and Cybernetics. Washington, USA: IEEE, 2015: 478-483. [6] 王 骏,王士同,邓赵红.特征加权距离与软子空间学习相结合的文本聚类新方法.计算机学报, 2012, 35(8): 1655-1665. (WANG J, WANG S T, DENG Z H. A Novel Text Clustering Algorithm Based on Feature Weighting Distance and Soft Subspace Learning. Chinese Journal of Computers, 2012, 35(8): 1655-1665.) [7] AGRAWAL R, GEHERKE J, GUNOPULOS D. et al. Automatic Subspace Clustering of High Dimensional Data. Data Mining and Knowledge Discovery, 2005, 11(1): 5-33. [8] CHU Y H, HUANG J W, CHUANG K T, et al. Density Conscious Subspace Clustering for High-Dimensional Data. IEEE Transaction on Knowledge and Data Engineering, 2010, 22(1): 16-30. [9] ELHAMIFAR E, VIDAL R. Sparse Subspace Clustering: Algorithm, Theory, and Applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(11): 2765-2781. [10] LU C Y, MIN H, ZHAO Z Q, et al. Robust and Efficient Subspace Segmentation via Least Squares Regression // Proc of the 12th European Conference on Computer Vision. Berlin, Germany: Springer, 2012: 347-360. [11] LIU G C, LIN Z C, YU Y. Robust Subspace Segmentation by Low Rank Representation // Proc of the 27th International Conference on Machine Leaning. Berlin, Germany: Springer, 2010: 663-670. [12] 刘 博,谢博鋆,朱 杰,等.快速可扩展的子空间聚类算法.模式识别与人工智能, 2016, 29(1): 11-21. (LIU B, XIE B J, ZHU J, et al. Fast Scalable Subspace Clus-tering Algorithm. Pattern Recognition and Artificial Intelligence, 2016, 29(1): 11-21.) [13] JING L P, NG M K, HUANG J Z X, et al. An Entropy Weighting K-means Algorithm for Subspace Clustering of High-Dimensional Sparse Data. IEEE Transactions on Knowledge and Data Enginee-ring, 2007, 19(8): 1026-1041. [14] BAADEL S, THABTAH F, LU J. Overlapping Clustering: A Review // Proc of the IEEE SAI Computing Conference. Washington, USA: IEEE, 2016: 233-237. [15] BEZDEK J C. Pattern Recognition with Fuzzy Objective Function Algorithms. New York, USA: Springer, 1981. [16] BANERJEE A, KRUMPELMAN C, GHOSH J, et al. Model-Based Overlapping Clustering // Proc of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. New York, USA: ACM, 2005: 532-537. [17] FU Q, BANERJEE A. Bayesian Overlapping Subspace Clustering // Proc of the 9th IEEE International Conference on Data Mining. Washington, USA: IEEE, 2009: 776-781. [18] 曹振丽,孙瑞志,李 勐.一种基于高斯混合模型的不确定数据流聚类方法.计算机研究与发展, 2014, 51(S2): 102-109. (CAO Z L, SUN R Z, LI M. A Method for Clustering Uncertain Data Streams Based on GMM. Journal of Computer Research and Development, 2014, 51(S2): 102-109.) [19] PANAGAKIS Y, KOTROPOULOS C. Elastic Net Subspace Clus-tering Applied to Pop/Music Structure Analysis. Pattern Recognition Letters, 2014, 38: 46-53. [20] SHI J B, MALIK J. Normalized Cuts and Image Segmentation. IEEE
Transaction on Pattern Analysis and Machine Intelligence, 2000, 22(8): 888-905. [21] FU Q, BANERJEE A. Multiplication Mixture Models for Overla-pping Clustering // Proc of the 8th IEEE International Conference on Data Mining. Washington, USA: IEEE, 2008: 791-796. [22] 刘展杰,陈晓云.局部子空间聚类.自动化学报, 2016, 42(8): 1238-1247. (LIU Z J, CHEN X Y. Local Subspace Clustering. Acta Automatica Sinica, 2016, 42(8): 1238-1247.) [23] 邱云飞,杨 倩,唐晓亮.基于粒子群优化的软子空间聚类算法.模式识别与人工智能, 2015, 28(10): 903-912. (QIU Y F, YANG Q, TANG X L. Soft Subspace Clustering Based on Particle Swarm Optimization. Pattern Recognition and Artificial Intelligence, 2015, 28(10): 903-912.) [24] AGGARWAL C C, WOLF J L, YU P S, et al. Fast Algorithms for Projected Clustering // Proc of the ACM SIGKDD International Conference on Management of Data. New York, USA: ACM, 1999: 61-72.