Abstract:When n-grams are considered as text classification features, the classification accuracy is decreased. The redundancy and relevance between words are ignored while n-grams are weighted. Thus, n-grams features weighting algorithm based on relevance and semantic is proposed. To decrease the internal redundancy, feature reduction is conducted to n-grams during text preprocessing. Then, n-grams are weighted according to the relevance of words and classes in n-grams and the semantic similarity of n-grams and testing dataset. The experimental results on Sougo Chinese news corpse and NetEase text corpse show that the proposed algorithm can select n-grams features of high relevance and low redundancy, and reduce the sparse data while quantifying the testing dataset.
邱云飞,刘世兴,林明明,邵良杉. 基于相关性及语义的n-grams特征加权算法*[J]. 模式识别与人工智能, 2015, 28(11): 992-1001.
QIU Yun-Fei, LIU Shi-Xing, LIN Ming-Ming, SHAO Liang-Shan. n-grams Features Weighting Algorithm Based on Relevance and Semantic. , 2015, 28(11): 992-1001.
[1] Pauls A, Klein D. Faster and Smaller N-Gram Language Models // Proc of the 49th Annual Meeting of the Association for Computational Linguistics. Portland, USA, 2011, I: 258-267 [2] Yu J K, Wang Y X, Chen H C. An Improved Text Feature Extraction Algorithm Based on N-Gram. Library and Information Service, 2004, 48(8): 48-50, 43 (in Chinese) (于津凯,王映雪,陈怀楚.一种基于N-Gram改进的文本特征提取算法.图书情报工作, 2004, 48(8): 48-50, 43) [3] Peagarikano M, Varona A, Rodríguez-Fuentes L J, et al. Dimensionality Reduction for Using High-Order n-Grams in SVM-Based Phonotactic Language Recognition // Proc of the 12th Annual Conference of the International Speech Communication Association. Florence, Italy, 2011: 853-856 [4] Zaki T, Es-saady Y, Mammass D, et al. A Hybrid Method N-Grams-TFIDF with Radial Basis for Indexing and Classification of Arabic Document. International Journal of Software Engineering and Its Applications, 2014, 8(2): 127-144 [5] Sidorov G, Velasquez F, Stamatatos E, et al. Syntactic Dependency-Based N-Grams as Classification Features // Proc of the 11th Mexican International Conference on Artificial Intelligence. San Luis Potosí, Mexico, 2013. DOI: 10.1007/978-3-642-37798_3_1 [6] Yi Y, Guan J H, Zhou S G. Effective Clustering of MicroRNA Sequences by N-Grams and Feature Weighting // Proc of the 6th IEEE International Conference on Systems Biology. Xi′an, China, 2012: 203-210 [7] Bouras C, Tsogkas V. Enhancing News Articles Clustering Using Word N-Grams // Proc of the 2nd International Conference on Data Technologies and Applications. Reykjavik, Iceland, 2013: 53-60 [8] Ghannay S, Barrault L. Using Hypothesis Selection Based Features for Confusion Network MT System Combination // Proc of the 3rd Workshop on Hybrid Approaches to Translation. Gothenburg, Sweden, 2014: 1-5 [9] Sidorov G, Velasquez F, Stamatatos E, et al. Syntactic N-Grams as Machine Learning Features for Natural Language Processing. Expert Systems with Applications, 2014, 41(3): 853-860 [10] Han Q, Guo J F, Schütze H. CodeX: Combining an SVM Classifier and Character N-Gram Language Models for Sentiment Analysis on Twitter Text // Proc of the 2nd Joint Conference on Lexical and Computational Semantics. Atlanta, USA, 2013, Ⅱ: 520-524 [11] Bespalov D, Bai B, Qi Y J, et al. Sentiment Classification Based on Supervised Latent n-Gram Analysis // Proc of the 20th ACM International Conference on Information and Knowledge Management. Glasgow, UK, 2011: 375-382 [12] Miller Z, Dickinson B, Hu W. Gender Prediction on Twitter Using Stream Algorithms with N-Gram Character Features. International Journal of Intelligence Science, 2012, 2(4): 143-148 [13] Hsu B J, Glass J. N-Gram Weighting: Reducing Training Data Mismatch in Cross-Domain Language Model Estimation // Proc of the Conference on Empirical Methods in Natural Language Processing. Honolulu, USA, 2008: 829-838 [14] Brown P F, Della Pietra V J, de Souza P V, et al. Class-Based n-Gram Models of Natural Language. Computational Linguistics, 1992, 18(4): 467-479 [15] Fürnkranz J. A Study Using n-Gram Features for Text Categorization. Technical Report, OEFAI-TR-98-30. Wien, Austria: Austrian Research Institute for Artificial Intelligence, 1998 [16] Peng H C, Long F H, Ding C. Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy. IEEE Trans on Pattern Analysis and Machine Intelligence, 2005, 27(8): 1226-1238 [17] Zhou S G, Guan J H, Hu Y F, et al. A Chinese Document Categorization System without Dictionary Support and Segmentation Processing. Journal of Computer Research and Development, 2001, 38(7): 839-844 (in Chinese) (周水庚,关佶红,胡运发,等.一个无需词典支持和切词处理的中文文档分类系统.计算机研究与发展, 2001, 38(7): 839-844) [18] Zhu Z Y, Sun J H. Improved Vocabulary Semantic Similarity Calculation Based on HowNet. Journal of Computer Applications, 2013, 33(8): 2276-2279, 2288 (in Chinese) (朱征宇,孙俊华.改进的基于《知网》的词汇语义相似度计算.计算机应用, 2013, 33(8): 2276-2279, 2288) [19] Ping Y. Research on Clustering and Text Categorization Based on Support Vector Machine. Ph.D Dissertation. Beijing, China: Beijing University of Posts and Telecommunications, 2012 (in Chinese) (平 源.基于支持向量机的聚类及文本分类研究.博士学位论文.北京:北京邮电大学, 2012)