|
|
A Semi-Structured Tibetan Text Clustering Algorithm Based on Swarm Intelligence |
KANG Jian1, QIAO Shao-Jie1, GESANG Duoji2, HAN Nan3, HONG Xi-Jin1, NIMA Zhaxi2, FAN Xiao-Gang1 |
1School of Information Science and Technology, Southwest Jiaotong University, Chengdu 610031
2College of Engineering, Tibet University, Lhasa 850000
3School of Life Science and Engineering, Southwest Jiaotong University, Chengdu 610031 |
|
|
Abstract To apply swarm intelligence techniques to cluster semi-structured Tibetan Web texts, a semi-structured Tibetan text clustering algorithm based on swarm Intelligence (SCAST) is proposed. Taking into a full consideration of accuracy and efficiency of Tibetan text clustering, a vector space model is used to express Tibetan texts, and the Tibetan texts and intelligent ants are randomly put in a two dimensional text vector space. Then, intelligent ants randomly select a Tibetan text, calculate the similarity between this text and others in the local area,and compute the probability of pick-up operation or drop-down operation to determine whether to pick up, move, or drop down the text. Finally, Tibetan texts are accurately clustered according to their similarities by iterative training of the proposed algorithm. The experimental results on real Tibetan Web text datasets show that the proposed algorithm is more accurate than the traditional k-means clustering algorithm with average increase of 8.0%.
|
Received: 26 June 2013
|
|
|
|
|
[1] Liu Y C, Wang X L, Xu Z M, et al. A Survey of Document Clustering. Journal of Chinese Information Processing, 2006, 20(3): 55-62 (in Chinese) (刘远超,王晓龙,徐志明,等.文档聚类综述.中文信息学报, 2006, 20(3): 55-62) [2] Dorigo M, Caro G D, Gambardella L M. Ant Algorithms for Discrete Optimization. Artificial Life, 1999, 5(2): 137-172 [3] Lumer E D, Faieta B. Diversity and Adaptation in Populations of Clustering Ants // Cliff D, Husbands P, Meyer J A, eds. Proceedings of the 3rd International Conference on Simulation of Adaptive Behavior: From Animals to Animats 3. Cambridge, Britain: MIT Press, 1994: 501-508 [4] Guan B. Research on the Segmentation Unit of Tibetan Word for Information Processing. Journal of Chinese Information Processing, 2010, 24(3): 124-128 (in Chinese) (关 白.信息处理用藏文分词单位研究.中文信息学报, 2010, 24(3): 124-128) [5] CaiRang Z M, Cai Z J. Development and Research of Tibetan Text Automatic Proofreading System. Journal of Northwest University for Nationalities: Natural Science, 2009, 30(1): 25-28 (in Chinese) (才让卓玛,才智杰.藏文文本自动校对系统开发研究.西北民族大学学报:自然科学版, 2009, 30(1): 25-28) [6] Cai Z J. Identification of Abbreviated Word in Tibetan Word Segmentation. Journal of Chinese Information Processing, 2009, 23(1): 35-37,43 (in Chinese) (才智杰.藏文自动分词系统中紧缩词的识别.中文信息学报, 2009, 23(1): 35-37,43) [7] Wu X D. Positive Maximum Matching Segmentation Algorithm Analysis and Improvement. Public Communication of Science & Technology, 2011, 10(20): 164-165 (in Chinese) (吴旭东.正向最大匹配分词算法的分析与改进.科技传播, 2011, 10(20): 164-165) [8] Yong C. Research on Lucene-Based Tibetan Full-Text Retrieval. Journal of Tibet University: Natural Science Edition, 2009, 24(1):58-60 (in Chinese) (拥 措.基于LUCENE的藏文全文检索的研究.西藏大学学报:自然科学版, 2009, 24(1): 58-60) [9] Huang C H, Yin J, Hou F. A Text Similarity Measurement Combining Word Semantic Information with TF-IDF Method. Chinese Journal of Computers, 2011, 34(5): 856-864 (in Chinese) (黄承慧,印 鉴,侯 昉.一种结合词项语义信息和TF-IDF方法的文本相似度量方法.计算机学报, 2011, 34(5): 856-864) [10] Zhong J, Liu L H, Liang C W. Active Semi-Supervised Text Clustering Based on Pairwise Constraints. Computer Engineering, 2011, 37(13): 193-186 (in Chinese) (钟 将,刘龙海,梁传伟.基于成对约束的主动半监督文本聚类.计算机工程, 2011, 37(13): 183-186) [11] Liu X Y. Text Clustering Algorithm with Ant Colony Based on the Best Solution Kept. Computer Engineering & Science, 2010, 32(5): 79-81 (in Chinese) (刘晓勇.基于最优适值保留的蚁群文本聚类算法.计算机工程与科学, 2010, 32(5): 79-81) [12] Ma S X, Liu D, Jia S J. Text Clustering Algorithm Based on Ant Colony Algorithm. Computer Engineering, 2010, 36(8): 206-207,210 (in Chinese) (马世霞,刘 丹,贾世杰.基于蚁群算法的文本聚类算法.计算机工程, 2010, 36(8): 206-207,210) [13] Wu B, Fu W P, Zheng Y, et al. A Clustering Algorithm Based on Swarm Intelligence for Web Document. Journal of Computer Research and Development, 2002, 39(11): 1429-1435 (in Chinese) (吴 斌,傅伟鹏,郑 毅,等.一种基于群体智能的Web文档聚类算法.计算机研究与发展, 2002, 39(11): 1429-1435) [14] Salton G, Wong A, Yang C S. A Vector Space Model for Automatic Indexing. Communications of the ACM, 1975, 18(11): 613-620 [15] Sebastiani F. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 2002, 34(1): 1-47 |
|
|
|