基于相关性及语义的<i>n</i>-grams特征加权算法<sup>*</sup>

doi:10.16451/j.cnki.issn1003-6059.201511005

Abstract
Figure/Table
References
Related Citation (7)

Download: PDF (901 KB) HTML (1 KB)
Export: BibTeX | EndNote (RIS)

Abstract When n-grams are considered as text classification features, the classification accuracy is decreased. The redundancy and relevance between words are ignored while n-grams are weighted. Thus, n-grams features weighting algorithm based on relevance and semantic is proposed. To decrease the internal redundancy, feature reduction is conducted to n-grams during text preprocessing. Then, n-grams are weighted according to the relevance of words and classes in n-grams and the semantic similarity of n-grams and testing dataset. The experimental results on Sougo Chinese news corpse and NetEase text corpse show that the proposed algorithm can select n-grams features of high relevance and low redundancy, and reduce the sparse data while quantifying the testing dataset.

Key words： Maximum Relevance Minimum Redundancy (mRMR) Semantic Similarity n-grams Feature Weighting

Received: 30 April 2014

ZTFLH:

TP 391.1

	Service

	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	QIU Yun-Fei
	LIU Shi-Xing
	LIN Ming-Ming
	SHAO Liang-Shan

Cite this article:

QIU Yun-Fei,LIU Shi-Xing,LIN Ming-Ming等. n-grams Features Weighting Algorithm Based on Relevance and Semantic[J]. , 2015, 28(11): 992-1001.

URL:

http://manu46.magtech.com.cn/Jweb_prai/EN/10.16451/j.cnki.issn1003-6059.201511005 OR http://manu46.magtech.com.cn/Jweb_prai/EN/Y2015/V28/I11/992

[1] Pauls A, Klein D. Faster and Smaller N-Gram Language Models // Proc of the 49th Annual Meeting of the Association for Computational Linguistics. Portland, USA, 2011, I: 258-267
[2] Yu J K, Wang Y X, Chen H C. An Improved Text Feature Extraction Algorithm Based on N-Gram. Library and Information Service, 2004, 48(8): 48-50, 43 (in Chinese)
(于津凯,王映雪,陈怀楚.一种基于N-Gram改进的文本特征提取算法.图书情报工作, 2004, 48(8): 48-50, 43)
[3] Peagarikano M, Varona A, Rodríguez-Fuentes L J, et al. Dimensionality Reduction for Using High-Order n-Grams in SVM-Based Phonotactic Language Recognition // Proc of the 12th Annual Conference of the International Speech Communication Association. Florence, Italy, 2011: 853-856
[4] Zaki T, Es-saady Y, Mammass D, et al. A Hybrid Method N-Grams-TFIDF with Radial Basis for Indexing and Classification of Arabic Document. International Journal of Software Engineering and Its Applications, 2014, 8(2): 127-144
[5] Sidorov G, Velasquez F, Stamatatos E, et al. Syntactic Dependency-Based N-Grams as Classification Features // Proc of the 11th Mexican International Conference on Artificial Intelligence. San Luis Potosí, Mexico, 2013. DOI: 10.1007/978-3-642-37798_3_1
[6] Yi Y, Guan J H, Zhou S G. Effective Clustering of MicroRNA Sequences by N-Grams and Feature Weighting // Proc of the 6th IEEE International Conference on Systems Biology. Xi′an, China, 2012: 203-210
[7] Bouras C, Tsogkas V. Enhancing News Articles Clustering Using Word N-Grams // Proc of the 2nd International Conference on Data Technologies and Applications. Reykjavik, Iceland, 2013: 53-60
[8] Ghannay S, Barrault L. Using Hypothesis Selection Based Features for Confusion Network MT System Combination // Proc of the 3rd Workshop on Hybrid Approaches to Translation. Gothenburg, Sweden, 2014: 1-5
[9] Sidorov G, Velasquez F, Stamatatos E, et al. Syntactic N-Grams as Machine Learning Features for Natural Language Processing. Expert Systems with Applications, 2014, 41(3): 853-860
[10] Han Q, Guo J F, Schütze H. CodeX: Combining an SVM Classifier and Character N-Gram Language Models for Sentiment Analysis on Twitter Text // Proc of the 2nd Joint Conference on Lexical and Computational Semantics. Atlanta, USA, 2013, Ⅱ: 520-524
[11] Bespalov D, Bai B, Qi Y J, et al. Sentiment Classification Based on Supervised Latent n-Gram Analysis // Proc of the 20th ACM International Conference on Information and Knowledge Management. Glasgow, UK, 2011: 375-382
[12] Miller Z, Dickinson B, Hu W. Gender Prediction on Twitter Using Stream Algorithms with N-Gram Character Features. International Journal of Intelligence Science, 2012, 2(4): 143-148
[13] Hsu B J, Glass J. N-Gram Weighting: Reducing Training Data Mismatch in Cross-Domain Language Model Estimation // Proc of the Conference on Empirical Methods in Natural Language Processing. Honolulu, USA, 2008: 829-838
[14] Brown P F, Della Pietra V J, de Souza P V, et al. Class-Based n-Gram Models of Natural Language. Computational Linguistics, 1992, 18(4): 467-479
[15] Fürnkranz J. A Study Using n-Gram Features for Text Categorization. Technical Report, OEFAI-TR-98-30. Wien, Austria: Austrian Research Institute for Artificial Intelligence, 1998
[16] Peng H C, Long F H, Ding C. Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy. IEEE Trans on Pattern Analysis and Machine Intelligence, 2005, 27(8): 1226-1238
[17] Zhou S G, Guan J H, Hu Y F, et al. A Chinese Document Categorization System without Dictionary Support and Segmentation Processing. Journal of Computer Research and Development, 2001, 38(7): 839-844 (in Chinese)
(周水庚,关佶红,胡运发,等.一个无需词典支持和切词处理的中文文档分类系统.计算机研究与发展, 2001, 38(7): 839-844)
[18] Zhu Z Y, Sun J H. Improved Vocabulary Semantic Similarity Calculation Based on HowNet. Journal of Computer Applications, 2013, 33(8): 2276-2279, 2288 (in Chinese)
(朱征宇,孙俊华.改进的基于《知网》的词汇语义相似度计算.计算机应用, 2013, 33(8): 2276-2279, 2288)
[19] Ping Y. Research on Clustering and Text Categorization Based on Support Vector Machine. Ph.D Dissertation. Beijing, China: Beijing University of Posts and Telecommunications, 2012 (in Chinese)
(平源.基于支持向量机的聚类及文本分类研究.博士学位论文.北京:北京邮电大学, 2012)