|
|
A Text Clustering Method Based on Speech to Text and Improved Center Selection |
SHI Kan-Sheng, LIU Hai-Tao, SONG Wen-Tao |
School of Electronic Information and Electrical Engineering,Shanghai Jiaotong University,Shanghai 200040 |
|
|
Abstract The traditional k-means algorithm is sensitive to the initial point and easy to fall into local optimum. An improved speech to text and improved center selection (STICS) based text clustering method is proposed. Taking into account the speech to text, the optimal selection of centers and treatment of outliers concurrently, STICS has three aspects of improvement. The weighted vector space model (VSM) is used to represent text according to the speech to text. For the selection of the center, the sample average similarity is measured for each sample, and the sample with the largest sample average similarity is selected as the first center. In addition, STICS method eliminates the negative influences of isolated points or outliers. Both theoretical analysis and experimental results prove that the proposed algorithm has better clustering results.
|
Received: 25 August 2011
|
|
|
|
|
[1] Liu Yuanchao,Wang Xiaolong,Xu Zhiming,et al.Survey of Text Clustering.Journal of Chinese Information,2006,20(3): 55-62 (in Chinese) (刘远超,王晓龙,徐志明,等.文档聚类综述.中文信息学报,2006,20(3): 55-62) [2] MacQueen J.Some Methods for Classification and Analysis of Multivariate Observations // Proc of the 5th Berkeley Symposium on Mathematical Statistics and Probability.Berkeley,USA,1967,Ⅰ: 281-297 [3] Chen Hao,He Tingting,Ji Donghong.An Unsupervised Approach to Word Sense Disambiguation Based on HowNet.Journal of Chinese Information Processing,2005,19(4): 10-16 (in Chinese) (陈 浩,何婷婷,姬东鸿.基于k-means聚类的无导词义消歧.中文信息学报,2005,19(4): 10-16) [4] Shameem M U S,Ferdous R.An Efficient k-means Algorithm Integrated with Jaccard Distance Measure for Document Clustering // Proc of the 1st Asian Himalayas International Conference on Internet.Kathmandu,Nepal,2009: 1-6 [5] Qing Xiaoping,Zheng Shijue.A New Method for Initializing the K-means Clustering Algorithm // Proc of the 2nd International Symposium on Knowledge Acquisition and Modeling.Wuhan,China,2009: 41-44 [6] Chen Xuhui,Xu Yong.K-means Clustering Algorithm with Refined Initial Center // Proc of the 2nd International Conference on Biomedical Engineering and Informatics.Tianjin,China,2009: 1-4 [7] Xu Houjin,Liu Yongyan,Deng Chengyu,et al.K-cmeans Text Clustering Algorithm Based on Similarity Center.Computer Engineering and Design,2010,31(8): 1802-1805 (in Chinese) (许厚金,刘永炎,邓成玉,等.基于相似中心的k-cmeans文本聚类算法.计算机工程与设计,2010,31(8): 1802-1805) [8] Salton G,Wong A,Yang C S.A Vector Space Model for Information Retrieval.Communications of the ACM,1975,18(11): 613-620 [9] Sahon G,Buckley B.Term-Weighting Approaches in Automatic Text Retrieval.Information Processing and Management,1988,24(5): 513-523 [10] Zhao Shiqi,Liu Ting,Li Sheng.Text Clustering Based on Subjects.Journal of Chinese Information Processing,2007,21(2): 58-61 (in Chinese) (赵世奇,刘 挺,李 生.一种基于主题的文本聚类方法.中文信息学报,2007,21(2): 58-61) [11] Zhao Ying,Karypis G.Evaluation of Hierarchical Clustering Algorithms for Document Dataset // Proc of the 11th International Conference on Information and Knowledge Management.New York,USA,2002: 515-524 [12] Shi Kansheng,Shi Zhangzu.Computer Aided Generation Method for Theme Report and Knowledge Base: China,200810063295.1.2011-05-08 (in Chinese) (施侃晟,施章祖.计算机辅助报告与知识库产生的方法.中国,200810063295.1.2011-05-08) |
|
|
|