An Effective High Attribute Dimensional Sparse Clustering
ZHAO YaQin1, ZHOU XianZhong2, HE Xin1, WANG JianYu1
1.School of Automation, Nanjing University of Science and Technology, Nanjing 210094 2.School of Management and Engineering, Nanjing University, Nanjing 210093
Abstract:Clustering analysis is one of the most important techniques in data mining with scale, dimension and sparseness of dataset being three key factors that influence accuracy of clustering. An effective clustering algorithm for the high attribute dimension sparse data is proposed in this paper. Definitions are given, such as sparse similarity, similarity between equivalence relations and generalized equivalence relation. Based on these definitions, the theory of equivalence relation is applied to form initial clusters. Initial equivalence relations are modified in terms of the similarity between two equivalence relations in order to obtain more reasonable clustering results. High dimensional sparse data is effectively compressed and expressed as sparse feature vector whose dimension is far lower than that of original data. As a result, the proposed approach can handle an array of high dimensional sparse data with high efficiency, and be independent of sequence of the objects.
[1] Han J, Kamber M. Data Mining: Concepts and Techniques. New York, USA: Morgan Kaufmann, 2001 (Han J, Kamber M,著;范 明, 孟小峰, 等,译. 数据挖掘概念与技术. 北京: 机械工业出版社, 2001) [2] Bradley P S, Fayyad U M, Reina C. Scaling Clustering Algorithms to Large Databases. In: Proc of the 4th International Conference on Knowledge Discovery and Data Mining. Menlo Park, USA, 1998, 9-15 [3] Wu S, Gao X D, et al. Knowledge Discovery for High Dimension Sparse Clustering. Beijing, China: Metallurgical Industry Press, 2003 (in Chinese) (武 森, 高学东,等. 高维稀疏聚类知识发现. 北京: 冶金工业出版社, 2003) [4] Hirano S, Tsumoto S, Okuzaki T, Hata Y. A Clustering Method for Nominal and Numerical Data Based on Rough Set Theory. In: Proc of the International Workshop on Rough Set Theory and Granular Computing. Matsue, Japan, 2001, 211-216 [5] Miao D Q, Wang J. An Information Representation of the Concepts and Operations in Rough Set Theory. Journal of Software, 1999, 10(2): 113-116 (in Chinese) (苗夺谦, 王 珏. 粗糙集理论中概念与运算的信息表示. 软件学报, 1999, 10(2): 113-116) [6] Zhou Y Q, Jiao L C. High Attribute Dimensional Sparse Clustering Recurrent Logical Neural Networks Model and Learning Algorithm. Acta Electronica Sinica, 2004, 32(8): 1342-1345 (in Chinese) (周永权, 焦李成. 高属性维稀疏数据聚类回归逻辑神经网络模型及学习算法. 电子学报, 2004, 32(8): 1342-1345) [7] An Q S, Shen J Y, Wang G Y. A Clustering Method Based on Information Granularity and Rough Sets. Pattern Recognition and Artificial Intelligence, 2003, 16(4): 412-417 (in Chinese) (安秋生, 沈钧毅, 王国胤. 基于信息粒度与Rough集的聚类方法研究. 模式识别与人工智能, 2003, 16(4): 412-417) [8] Hirano S, Tsumoto S. Dealing with Relatively Proximity by Rough Clustering. In: Proc of the 22nd International Conference of the North American Fuzzy Information Processing Society. Chicago, USA, 2003, 260-265