Abstract:The potential complementation between different languages is ignored while traditional clustering algorithms discover the hidden structures in document collection. Thus, the latent information in the collection can not be reflected by the obtained patterns. Aiming at this problem, multilingual document clustering algorithm based on parallel information bottleneck(ML-IB) is proposed. Firstly, the relevant variables of multiple language information are constructed according to the bag-of-words model. Then,the multiple relevant variables are incorporated into the parallel information bottleneck, and the relevant information between data patterns and multiple relevant variables is preserved maximally. Finally, to optimize the objective function of ML-IB, a draw and merge method based on information theory is proposed to guarantee the convergence of ML-IB to a local optimal solution. Extensive experimental results on multilingual document datasets show that the proposed algorithm significantly outperform the state-of-the-art single and multilingual clustering methods.
闫小强,卢耀恩,娄铮铮,叶阳东. 基于并行信息瓶颈的多语种文本聚类算法*[J]. 模式识别与人工智能, 2017, 30(6): 559-568.
YAN Xiaoqiang, LU Yaoen, LOU Zhengzheng, YE Yangdong. Multilingual Documents Clustering Algorithm Based on Parallel Information Bottleneck. , 2017, 30(6): 559-568.
[1] HOFMANN T. Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning, 2001, 42(1): 177-196. [2] BLEI D M, NG A Y, JORDAN M I. Latent Dirichlet Allocation. Journal of Machine Learning Research, 2003, 3: 993-1022. [3] 叶阳东,张 洁,刘 东.一种优化的顺序IB文本聚类算法.模式识别与人工智能, 2008, 21(3): 417-423. (YE Y D, ZHANG J, LIU D. An Improved Sequential IB Algorithm for Document Clustering. Pattern Recognition and Artificial Intelligence, 2008, 21(3): 417-423.) [4] AMINI M R, USUNIER N, GOUTTE C. Learning from Multiple Partially Observed Views // BENGIO Y, SCHUURMANS D, LAFFERTY J D,et al.,eds.Advances in Neural Information Proce- ssing Systems 22. Cambridge, USA: The MIT Press, 2009: 28-36. [5] AMINI M R, GOUTTE C. A Co-classification Approach to Learning from Multilingual Corpora. Machine Learning, 2010, 79(1): 105-121. [6] KISHIDA K. Double-Pass Clustering Technique for Multilingual Document Collections. Journal of Information Science, 2011, 37(3): 304-321. [7] AHMAD F, WIDN G. Language Clustering and Knowledge Sharing in Multilingual Organizations: A Social Perspective on Language. Journal of Information Science, 2015, 41(4): 430-443. [8] MONTALVO S, MMRTINEZ R, CASILLAS A, et al. Multilingual Document Clustering: An Heuristic Approach Based on Cognate Named Entities // Proc of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, USA: Association for Computational Linguistics, 2006: 1145-1152. [9] WEI C P, YANG C C, LIN C M. A Latent Semantic Indexing-Based Approach to Multilingual Document Clustering. Decision Support Systems, 2008, 45(3): 606-620. [10] KUMAR A, RAI P, DAUME H. Co-regularized Multi-view Spectral Clustering // SHAWE-TAYLOR J, ZERNEL R S, BARTLENT P L, et al., eds. Advances in Neural Information Processing Systems 24. Cambridge, USA: The MIT Press, 2011: 1413-1421. [11] KUMAR A, DAUM H III. A Co-training Approach for Multi-view Spectral Clustering[C/OL]. [2016-08-20]. http://www.icml-2011.org/papers/272_icmlpaper.pdf. [12] XIA R K, PAN Y, DU L, et al. Robust Multi-view Spectral Clustering via Low-Rank and Sparse Decomposition // Proc of the 28th AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2014: 2149-2155. [13] CAI X, NIE F P, CAI W D, et al. Heterogeneous Image Feature Integration via Multi-modal Spectral Clustering // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2013: 1737-1744. [14] HUANG H C, CHUANG Y Y, CHEN C S. Affinity Aggregation for Spectral Clustering // Proc of the IEEE Conference on Compu- ter Vision and Pattern Recognition. Washington, USA: IEEE, 2012: 773-780. [15] STREHL A, GHOSH J. Cluster Ensembles-A Knowledge Reuse Framework for Combining Multiple Partitions. Journal of Machine Learning Research, 2002, 3: 583-617. [16] WANG H J, SHAN H H, BANERJEE A. Bayesian Cluster Ensembles. Statistical Analysis and Data Mining, 2011, 4(1): 54-70. [17] KIM Y M, AMINI M R, GOUTTE C, et al. Multi-view Clustering of Multilingual Documents // Proc of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM, 2010: 821-822. [18] REN Z C, INEL D, AROYO L, et al. Time-Aware Multi-viewpoint Summarization of Multilingual Social Text Streams // Proc of the 25th ACM International Conference on Information and Knowledge Management. New York, USA: ACM, 2016: 387-396. [19] AMINI M R, GOUTTE C, USUNIER N. Combining Coregularization and Consensus-Based Self-training for Multilingual Text Categorization // Proc of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM, 2010: 475-482. [20] TISHBY N, PEREIRA F C, BIALEK W. The Information Bottleneck Method[C/OL]. [2016-08-20]. https://arxiv.org/pdf/physics/0004057v1.pdf. [21] 夏利民,谭立球,钟 洪.基于信息瓶颈算法的图像语义标注.模式识别与人工智能, 2008, 21(6): 812-818. (XIA L M, TAN L Q, HONG Z. Semantic Annotation of Image Based on Information Bottleneck Method. Pattern Recognition and Artificial Intelligence, 2008, 21(6): 812-818.) [22] LAZEBNIK S, RAGINSKY M. Supervised Learning of Quantizer Codebooks by Information Loss Minimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(7): 1294-1309. [23] XU C, TAO D C, XU C. Large-Margin Multi-view Information Bottleneck. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(8): 1559-1572. [24] YAN X Q, YE Y D, LOU Z Z. Unsupervised Video Categorization Based on Multivariate Information Bottleneck Method. Knowledge-Based Systems, 2015, 84(1): 34-45. [25] TISHBY N, ZASLAVSKY N. Deep Learning and the Information Bottleneck Principle[C/OL]. [2016-08-20]. http://ieeexplore. ieee.org/stamp/stamp.jsp?arnumber=7133169. [26] MOTIIAN S, PICCIRILLI M, ADJEROH D A, et al. Information Bottleneck Learning Using Privileged Information for Visual Recognition // Proc of the 29th IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2006: 1496-1505. [27] SLONIM N, FRIEDMAN N, TISHBY N. Multivariate Information Bottleneck. Neural Computation, 2006, 18(8): 1739-1789. [28] CAI D, WANG X H, HE X F. Probabilistic Dyadic Data Analysis with Local and Global Consistency // Proc of the 26th Annual International Conference on Machine Learning. New York, USA: ACM, 2009: 105-112.