基于群体智能的半结构化藏文文本聚类算法<sup>*</sup>

摘要
图/表
参考文献(0)
相关文章 (14)

全文: PDF (1001 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要

将群体智能技术应用于半结构化的藏文Web文本聚类，提出基于群体智能的半结构化藏文Web文本聚类算法 (SCAST).充分考虑群体智能技术对藏文文本聚类准确性和时间效率的影响，SCAST算法首先运用向量空间模型表示藏文文本信息，将藏文文本和智能蚁群随机放置于一个文本向量空间中.然后智能蚂蚁随机选择藏文文本，计算藏文文本在当前局部区域内的相似性，获得拾起或者放下文本的概率，进而决定是否“拾起”，“移动”，“放下”藏文文本.最后通过多次迭代训练，将藏文文本按其相似性聚集在一起，得到最终聚类结果.大量真实藏文Web文本数据上的实验结果表明，相较于传统的k-means聚类算法，基于群体智能的藏文文本聚类算法在聚类准确率上平均提高约8.0%.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	康健
	乔少杰
	格桑多吉
	韩楠
	洪西进
	尼玛扎西
	范小刚

关键词 ：群体智能, 藏文, 聚类分析, 群体相似度

Abstract：

To apply swarm intelligence techniques to cluster semi-structured Tibetan Web texts, a semi-structured Tibetan text clustering algorithm based on swarm Intelligence (SCAST) is proposed. Taking into a full consideration of accuracy and efficiency of Tibetan text clustering, a vector space model is used to express Tibetan texts, and the Tibetan texts and intelligent ants are randomly put in a two dimensional text vector space. Then, intelligent ants randomly select a Tibetan text, calculate the similarity between this text and others in the local area,and compute the probability of pick-up operation or drop-down operation to determine whether to pick up, move, or drop down the text. Finally, Tibetan texts are accurately clustered according to their similarities by iterative training of the proposed algorithm. The experimental results on real Tibetan Web text datasets show that the proposed algorithm is more accurate than the traditional k-means clustering algorithm with average increase of 8.0%.

Key words： Swarm Intelligence Tibetan Text Clustering Analysis Swarm Similarity

收稿日期: 2013-06-26

ZTFLH:

TP311

基金资助:

国家自然科学基金项目(No.61165013, 61100045)、教育部人文社会科学研究青年基金项目(No.14YJCZH046)、高等学校博士学科点专项科研基金项目(No.20110184120008)、中国博士后科学基金特别项目(No.201104697)、中央高校基本科研业务费专项资金项目(No.2682013BR023)、四川省科技创新苗子工程项目(No.2012ZZ059)资助

作者简介: 康健，男，1986年生，硕士，主要研究方向为群体智能涌现、藏文信息处理.E-mail:kangjian_0123@163.com.乔少杰，男，1981年生，博士后，副教授，主要研究方向为数据库、群体智能涌现、移动社交网络.格桑多吉，男，1972年生，硕士，副教授，主要研究方向为藏文信息处理.韩楠(通讯作者)，女，1984年生，博士，工程师，主要研究方向为数据库、生物信息学.E-mail:hannan@swtju.edu.cn.洪西进，男，1957年生，教授，博士生导师，主要研究方向为生物统计学、信息安全和图像处理.尼玛扎西，男，1972年生，副教授，主要研究方向为藏文信息处理.范小刚，男，1991年生，硕士，主要研究方向为藏文信息处理.

引用本文:

康健，乔少杰，格桑多吉，韩楠，洪西进，尼玛扎西，范小刚. 基于群体智能的半结构化藏文文本聚类算法^*[J]. 模式识别与人工智能, 2014, 27(7): 663-672. KANG Jian, QIAO Shao-Jie, GESANG Duoji, HAN Nan, HONG Xi-Jin, NIMA Zhaxi, FAN Xiao-Gang. A Semi-Structured Tibetan Text Clustering Algorithm Based on Swarm Intelligence. , 2014, 27(7): 663-672.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/ 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2014/V27/I7/663

[1] Liu Y C, Wang X L, Xu Z M, et al. A Survey of Document Clustering. Journal of Chinese Information Processing, 2006, 20(3): 55-62 (in Chinese)
(刘远超,王晓龙,徐志明,等.文档聚类综述.中文信息学报, 2006, 20(3): 55-62)
[2] Dorigo M, Caro G D, Gambardella L M. Ant Algorithms for Discrete Optimization. Artificial Life, 1999, 5(2): 137-172
[3] Lumer E D, Faieta B. Diversity and Adaptation in Populations of Clustering Ants // Cliff D, Husbands P, Meyer J A, eds. Proceedings of the 3rd International Conference on Simulation of Adaptive Behavior: From Animals to Animats 3. Cambridge, Britain: MIT Press, 1994: 501-508
[4] Guan B. Research on the Segmentation Unit of Tibetan Word for Information Processing. Journal of Chinese Information Processing, 2010, 24(3): 124-128 (in Chinese)
(关白.信息处理用藏文分词单位研究.中文信息学报, 2010, 24(3): 124-128)
[5] CaiRang Z M, Cai Z J. Development and Research of Tibetan Text Automatic Proofreading System. Journal of Northwest University for Nationalities: Natural Science, 2009, 30(1): 25-28 (in Chinese)
(才让卓玛,才智杰.藏文文本自动校对系统开发研究.西北民族大学学报:自然科学版, 2009, 30(1): 25-28)
[6] Cai Z J. Identification of Abbreviated Word in Tibetan Word Segmentation. Journal of Chinese Information Processing, 2009, 23(1): 35-37,43 (in Chinese)
(才智杰.藏文自动分词系统中紧缩词的识别.中文信息学报, 2009, 23(1): 35-37,43)
[7] Wu X D. Positive Maximum Matching Segmentation Algorithm Analysis and Improvement. Public Communication of Science & Technology, 2011, 10(20): 164-165 (in Chinese)
(吴旭东.正向最大匹配分词算法的分析与改进.科技传播, 2011, 10(20): 164-165)
[8] Yong C. Research on Lucene-Based Tibetan Full-Text Retrieval. Journal of Tibet University: Natural Science Edition, 2009, 24(1):58-60 (in Chinese)
(拥措.基于LUCENE的藏文全文检索的研究.西藏大学学报:自然科学版, 2009, 24(1): 58-60)
[9] Huang C H, Yin J, Hou F. A Text Similarity Measurement Combining Word Semantic Information with TF-IDF Method. Chinese Journal of Computers, 2011, 34(5): 856-864 (in Chinese)
(黄承慧,印鉴,侯昉.一种结合词项语义信息和TF-IDF方法的文本相似度量方法.计算机学报, 2011, 34(5): 856-864)
[10] Zhong J, Liu L H, Liang C W. Active Semi-Supervised Text Clustering Based on Pairwise Constraints. Computer Engineering, 2011, 37(13): 193-186 (in Chinese)
(钟将,刘龙海,梁传伟.基于成对约束的主动半监督文本聚类.计算机工程, 2011, 37(13): 183-186)
[11] Liu X Y. Text Clustering Algorithm with Ant Colony Based on the Best Solution Kept. Computer Engineering & Science, 2010, 32(5): 79-81 (in Chinese)
(刘晓勇.基于最优适值保留的蚁群文本聚类算法.计算机工程与科学, 2010, 32(5): 79-81)
[12] Ma S X, Liu D, Jia S J. Text Clustering Algorithm Based on Ant Colony Algorithm. Computer Engineering, 2010, 36(8): 206-207,210 (in Chinese)
(马世霞,刘丹,贾世杰.基于蚁群算法的文本聚类算法.计算机工程, 2010, 36(8): 206-207,210)
[13] Wu B, Fu W P, Zheng Y, et al. A Clustering Algorithm Based on Swarm Intelligence for Web Document. Journal of Computer Research and Development, 2002, 39(11): 1429-1435 (in Chinese)
(吴斌,傅伟鹏,郑毅,等.一种基于群体智能的Web文档聚类算法.计算机研究与发展, 2002, 39(11): 1429-1435)
[14] Salton G, Wong A, Yang C S. A Vector Space Model for Automatic Indexing. Communications of the ACM, 1975, 18(11): 613-620
[15] Sebastiani F. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 2002, 34(1): 1-47