基于词性和中心点改进的文本聚类方法

摘要
图/表
参考文献
相关文章 (6)

全文: PDF (410 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要针对k-均值算法对初始点敏感、易陷入局部最优的问题，提出一种基于词性和中心点改进的文本聚类方法(STICS)。通过改进文本的语义型表示，优化中心点的选取，并消除孤立点的负面影响，从而获得较好的聚类效果。STICS考虑不同词性特征对文本的贡献，采用加权的向量空间模型来表示文本。对于中心点的选取，首先度量每个样本的样本平均相似度，其次选取样本平均相似度最大的样本作为第一个聚类中心。此外，STICS消除孤立点的负面影响，以此提高聚类效果。实验结果表明文中方法确实具有更好的聚类效果。

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	施侃晟
	刘海涛
	宋文涛

关键词 ：文本聚类, k-均值, 词性特征, 样本平均相似度, 孤立点

Abstract：The traditional k-means algorithm is sensitive to the initial point and easy to fall into local optimum. An improved speech to text and improved center selection (STICS) based text clustering method is proposed. Taking into account the speech to text, the optimal selection of centers and treatment of outliers concurrently, STICS has three aspects of improvement. The weighted vector space model (VSM) is used to represent text according to the speech to text. For the selection of the center, the sample average similarity is measured for each sample, and the sample with the largest sample average similarity is selected as the first center. In addition, STICS method eliminates the negative influences of isolated points or outliers. Both theoretical analysis and experimental results prove that the proposed algorithm has better clustering results.

Key words： Text Clustering k-means Speech to Text Sample Average Similarity Outlier

收稿日期: 2011-08-25

ZTFLH:

TP3

基金资助:国家自然科学基金资助项目(No.60970107)

作者简介: 施侃晟，男，1966年生，博士，教授，主要研究方向为云计算、智能挖掘领域。E-mail:steve@joinvc。com。刘海涛，男，1974年生，博士，副教授，主要研究方向为海量数据处理、物联网领域。宋文涛，男，1936年生，教授，博士生导师，主要研究方向为网络通讯、海量数据处理。

引用本文:

施侃晟，刘海涛，宋文涛. 基于词性和中心点改进的文本聚类方法[J]. 模式识别与人工智能, 2012, 25(6): 996-1001. SHI Kan-Sheng, LIU Hai-Tao, SONG Wen-Tao. A Text Clustering Method Based on Speech to Text and Improved Center Selection. , 2012, 25(6): 996-1001.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/ 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2012/V25/I6/996

[1] Liu Yuanchao,Wang Xiaolong,Xu Zhiming,et al.Survey of Text Clustering.Journal of Chinese Information,2006,20(3): 55-62 (in Chinese)
(刘远超,王晓龙,徐志明,等.文档聚类综述.中文信息学报,2006,20(3): 55-62)
[2] MacQueen J.Some Methods for Classification and Analysis of Multivariate Observations // Proc of the 5th Berkeley Symposium on Mathematical Statistics and Probability.Berkeley,USA,1967,Ⅰ: 281-297
[3] Chen Hao,He Tingting,Ji Donghong.An Unsupervised Approach to Word Sense Disambiguation Based on HowNet.Journal of Chinese Information Processing,2005,19(4): 10-16 (in Chinese)
(陈浩,何婷婷,姬东鸿.基于k-means聚类的无导词义消歧.中文信息学报,2005,19(4): 10-16)
[4] Shameem M U S,Ferdous R.An Efficient k-means Algorithm Integrated with Jaccard Distance Measure for Document Clustering // Proc of the 1st Asian Himalayas International Conference on Internet.Kathmandu,Nepal,2009: 1-6
[5] Qing Xiaoping,Zheng Shijue.A New Method for Initializing the K-means Clustering Algorithm // Proc of the 2nd International Symposium on Knowledge Acquisition and Modeling.Wuhan,China,2009: 41-44
[6] Chen Xuhui,Xu Yong.K-means Clustering Algorithm with Refined Initial Center // Proc of the 2nd International Conference on Biomedical Engineering and Informatics.Tianjin,China,2009: 1-4
[7] Xu Houjin,Liu Yongyan,Deng Chengyu,et al.K-cmeans Text Clustering Algorithm Based on Similarity Center.Computer Engineering and Design,2010,31(8): 1802-1805 (in Chinese)
(许厚金,刘永炎,邓成玉,等.基于相似中心的k-cmeans文本聚类算法.计算机工程与设计,2010,31(8): 1802-1805)
[8] Salton G,Wong A,Yang C S.A Vector Space Model for Information Retrieval.Communications of the ACM,1975,18(11): 613-620
[9] Sahon G,Buckley B.Term-Weighting Approaches in Automatic Text Retrieval.Information Processing and Management,1988,24(5): 513-523
[10] Zhao Shiqi,Liu Ting,Li Sheng.Text Clustering Based on Subjects.Journal of Chinese Information Processing,2007,21(2): 58-61 (in Chinese)
(赵世奇,刘挺,李生.一种基于主题的文本聚类方法.中文信息学报,2007,21(2): 58-61)
[11] Zhao Ying,Karypis G.Evaluation of Hierarchical Clustering Algorithms for Document Dataset // Proc of the 11th International Conference on Information and Knowledge Management.New York,USA,2002: 515-524
[12] Shi Kansheng,Shi Zhangzu.Computer Aided Generation Method for Theme Report and Knowledge Base: China,200810063295.1.2011-05-08 (in Chinese)
(施侃晟,施章祖.计算机辅助报告与知识库产生的方法.中国,200810063295.1.2011-05-08)