Abstract:To solve the problems of local optima and low efficiency in sequential information bottleneck (sIB) algorithm for document clustering, an improved sIB algorithm is proposed, namely SA-isIB. By a reasonable annealing sequence, a certain proportional of documents are selected randomly from the initial clustering solution of basic sIB algorithm. Then the clustering labels of selected documents are revised and the solution is optimized iteratively. After the process of simulated annealing, higher accuracy document clustering solutions are obtained. Experimental results on document datasets show that by using SA-isIB algorithm the accuracy of sIB algorithm for document clustering is improved efficiently.
[1] Tishby N, Pereira F, Bialek W. The Information Bottleneck Method // Proc of the 37th Annual Allerton Conference on Communication, Control and Computing. Illinois, USA, 1999: 368-377 [2] Slonim N, Friedman N, Tishby N. Unsupervised Document Classification Using Sequential Information Maximization // Proc of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Tampere, Finland, 2002: 129-136 [3] Goldberger J, Gordon S, Greenspan H. Unsupervised Image-Set Clustering Using an Information Theoretic Framework. IEEE Trans on Image Processing, 2006, 15(2): 449-458 [4] Slonim N, Somerville R, Tishby N, et al. Objective Classification of Galaxy Spectra Using the Information Bottleneck Method. Monthly Notices of the Royal Astronomical Society, 2001, 323(2): 270-284 [5] Tishby N, Slonim N. Data Clustering by Markovian Relaxation and the Information Bottleneck Method // Proc of the 13th Annual Conference on Neural Information Processing Systems. Colorado, USA, 2001: 640-646 [6] Schneidman E, Bialek W, Berry M J. An Information Theoretic Approach to the Functional Classification of Neurons // Proc of the 15th Annual Conference on Neural Information Processing Systems. Vancouver, Canada, 2002: 197-204 [7] Gorodetsky M. Methods for Discovering Semantic Relations between Words Based on Co-Occurrence Patterns in Corpora. Masters Dissertation. Jerusalem, Palestine: Hebrew University. School of Computer Science and Engineering, 2002 [8] Slonim N. The Information Bottleneck: Theory and Application. Ph.D Dissertation. Jerusalem, Palestine: Hebrew University. School of Computer Science and Engineering, 2002 [9] Chechik G, Tishby N. Extracting Relevant Structures with Side Information // Proc of the 16th Annual Conference on Neural Information Processing Systems. Vancouver, Canada, 2002: 857-864 [10] Gondek D, Hofmann T. Non-Redundant Data Clustering // Proc of the 4th IEEE International Conference on Data Mining. Brighton, UK, 2004: 75-82 [11] Elidan G, Friedman N. Learning Hidden Variable Networks: The Information Bottleneck Approach. Journal of Machine Learning Research, 2005, 6(1): 81-127 [12] Chechik G, Globerson A, Tishby N, et al. Information Bottleneck for Gaussian Variables. Journal of Machine Learning Research, 2005, 6(1): 165-188