基于并行信息瓶颈的多语种文本聚类算法<sup>*</sup>

doi:10.16451/j.cnki.issn1003-6059.201706009

摘要
图/表
参考文献
相关文章 (5)

全文: PDF (640 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要聚类算法在抽取文本数据中的模式结构时,忽略多个语种信息之间潜在的互补作用,得到的模式结构不能充分反映数据的内在信息.针对此问题,文中提出基于并行信息瓶颈的多语种文本聚类算法.首先使用词袋模型为文本数据的不同语种信息构建相应的相关变量.然后将多种相关变量引入并行信息瓶颈方法,通过最大化地保存模式结构与多个相关变量之间的信息,使得到的模式结构能够反映数据的多个语种信息.最后提出基于信息论的抽取合并方法优化文中算法的目标函数,保证其收敛到局部最优解.实验表明,文中算法能有效处理文本数据的多个语种信息,性能优于单语种聚类算法和现有的两类能够处理文本多语种信息的聚类算法.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	闫小强
	卢耀恩
	娄铮铮
	叶阳东

关键词 ：并行信息瓶颈, 多语种, 文本聚类, 信息最大化

Abstract：The potential complementation between different languages is ignored while traditional clustering algorithms discover the hidden structures in document collection. Thus, the latent information in the collection can not be reflected by the obtained patterns. Aiming at this problem, multilingual document clustering algorithm based on parallel information bottleneck(ML-IB) is proposed. Firstly, the relevant variables of multiple language information are constructed according to the bag-of-words model. Then,the multiple relevant variables are incorporated into the parallel information bottleneck, and the relevant information between data patterns and multiple relevant variables is preserved maximally. Finally, to optimize the objective function of ML-IB, a draw and merge method based on information theory is proposed to guarantee the convergence of ML-IB to a local optimal solution. Extensive experimental results on multilingual document datasets show that the proposed algorithm significantly outperform the state-of-the-art single and multilingual clustering methods.

Key words： Parallel Information Bottleneck Multilingual Document Clustering Information

收稿日期: 2016-09-26

ZTFLH:

TP 391.4

基金资助:国家自然科学基金项目(No.61502434,61502432,61170223)资助

作者简介: 闫小强,男,1989年生,博士研究生,主要研究方向为机器学习、模式识别、计算机视觉.E-mail:iexqyan@zzu.edu.cn.
卢耀恩,男,1989年生,硕士研究生,主要研究方向为模式识别、数据挖掘.E-mail:ieyelu@zzu.edu.cn.
娄铮铮,男,1984年生,博士,副教授,主要研究方向为机器学习、模式识别、数据挖掘.E-mail:iezzlou@zzu.edu.cn.
叶阳东(通讯作者),男,1962年生,博士,教授,主要研究方向为智能系统、数据库、机器学习.E-mail:yeyd@zzu.edu.cn.

引用本文:

闫小强，卢耀恩，娄铮铮，叶阳东. 基于并行信息瓶颈的多语种文本聚类算法^*[J]. 模式识别与人工智能, 2017, 30(6): 559-568. YAN Xiaoqiang, LU Yaoen, LOU Zhengzheng, YE Yangdong. Multilingual Documents Clustering Algorithm Based on Parallel Information Bottleneck. , 2017, 30(6): 559-568.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/10.16451/j.cnki.issn1003-6059.201706009 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2017/V30/I6/559

[1] HOFMANN T. Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning, 2001, 42(1): 177-196.
[2] BLEI D M, NG A Y, JORDAN M I. Latent Dirichlet Allocation. Journal of Machine Learning Research, 2003, 3: 993-1022.
[3] 叶阳东,张洁,刘东.一种优化的顺序IB文本聚类算法.模式识别与人工智能, 2008, 21(3): 417-423.
(YE Y D, ZHANG J, LIU D. An Improved Sequential IB Algorithm for Document Clustering. Pattern Recognition and Artificial Intelligence, 2008, 21(3): 417-423.)
[4] AMINI M R, USUNIER N, GOUTTE C. Learning from Multiple Partially Observed Views // BENGIO Y, SCHUURMANS D, LAFFERTY J D,et al.,eds.Advances in Neural Information Proce-
ssing Systems 22. Cambridge, USA: The MIT Press, 2009: 28-36.
[5] AMINI M R, GOUTTE C. A Co-classification Approach to Learning from Multilingual Corpora. Machine Learning, 2010, 79(1): 105-121.
[6] KISHIDA K. Double-Pass Clustering Technique for Multilingual Document Collections. Journal of Information Science, 2011, 37(3): 304-321.
[7] AHMAD F, WIDN G. Language Clustering and Knowledge Sharing in Multilingual Organizations: A Social Perspective on Language. Journal of Information Science, 2015, 41(4): 430-443.
[8] MONTALVO S, MMRTINEZ R, CASILLAS A, et al. Multilingual Document Clustering: An Heuristic Approach Based on Cognate Named Entities // Proc of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, USA: Association for Computational Linguistics, 2006: 1145-1152.
[9] WEI C P, YANG C C, LIN C M. A Latent Semantic Indexing-Based Approach to Multilingual Document Clustering. Decision Support Systems, 2008, 45(3): 606-620.
[10] KUMAR A, RAI P, DAUME H. Co-regularized Multi-view Spectral Clustering // SHAWE-TAYLOR J, ZERNEL R S, BARTLENT P L, et al., eds. Advances in Neural Information Processing Systems 24. Cambridge, USA: The MIT Press, 2011: 1413-1421.
[11] KUMAR A, DAUM H III. A Co-training Approach for Multi-view Spectral Clustering[C/OL]. [2016-08-20]. http://www.icml-2011.org/papers/272_icmlpaper.pdf.
[12] XIA R K, PAN Y, DU L, et al. Robust Multi-view Spectral Clustering via Low-Rank and Sparse Decomposition // Proc of the 28th AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2014: 2149-2155.
[13] CAI X, NIE F P, CAI W D, et al. Heterogeneous Image Feature Integration via Multi-modal Spectral Clustering // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2013: 1737-1744.
[14] HUANG H C, CHUANG Y Y, CHEN C S. Affinity Aggregation for Spectral Clustering // Proc of the IEEE Conference on Compu-
ter Vision and Pattern Recognition. Washington, USA: IEEE, 2012: 773-780.
[15] STREHL A, GHOSH J. Cluster Ensembles-A Knowledge Reuse Framework for Combining Multiple Partitions. Journal of Machine Learning Research, 2002, 3: 583-617.
[16] WANG H J, SHAN H H, BANERJEE A. Bayesian Cluster Ensembles. Statistical Analysis and Data Mining, 2011, 4(1): 54-70.
[17] KIM Y M, AMINI M R, GOUTTE C, et al. Multi-view Clustering of Multilingual Documents // Proc of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM, 2010: 821-822.
[18] REN Z C, INEL D, AROYO L, et al. Time-Aware Multi-viewpoint Summarization of Multilingual Social Text Streams // Proc of the 25th ACM International Conference on Information and Knowledge Management. New York, USA: ACM, 2016: 387-396.
[19] AMINI M R, GOUTTE C, USUNIER N. Combining Coregularization and Consensus-Based Self-training for Multilingual Text Categorization // Proc of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM, 2010: 475-482.
[20] TISHBY N, PEREIRA F C, BIALEK W. The Information Bottleneck Method[C/OL]. [2016-08-20]. https://arxiv.org/pdf/physics/0004057v1.pdf.
[21] 夏利民,谭立球,钟洪.基于信息瓶颈算法的图像语义标注.模式识别与人工智能, 2008, 21(6): 812-818.
(XIA L M, TAN L Q, HONG Z. Semantic Annotation of Image Based on Information Bottleneck Method. Pattern Recognition and Artificial Intelligence, 2008, 21(6): 812-818.)
[22] LAZEBNIK S, RAGINSKY M. Supervised Learning of Quantizer Codebooks by Information Loss Minimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(7): 1294-1309.
[23] XU C, TAO D C, XU C. Large-Margin Multi-view Information Bottleneck. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(8): 1559-1572.
[24] YAN X Q, YE Y D, LOU Z Z. Unsupervised Video Categorization
Based on Multivariate Information Bottleneck Method. Knowledge-Based Systems, 2015, 84(1): 34-45.
[25] TISHBY N, ZASLAVSKY N. Deep Learning and the Information Bottleneck Principle[C/OL]. [2016-08-20]. http://ieeexplore.
ieee.org/stamp/stamp.jsp?arnumber=7133169.
[26] MOTIIAN S, PICCIRILLI M, ADJEROH D A, et al. Information Bottleneck Learning Using Privileged Information for Visual Recognition // Proc of the 29th IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2006: 1496-1505.
[27] SLONIM N, FRIEDMAN N, TISHBY N. Multivariate Information Bottleneck. Neural Computation, 2006, 18(8): 1739-1789.
[28] CAI D, WANG X H, HE X F. Probabilistic Dyadic Data Analysis with Local and Global Consistency // Proc of the 26th Annual International Conference on Machine Learning. New York, USA: ACM, 2009: 105-112.