维吾尔文无监督自动切分及无监督特征选择

Abstract
Figure/Table
References
Related Citation (15)

Download: PDF (544 KB) HTML (1 KB)
Export: BibTeX | EndNote (RIS)

Abstract Commonly used Uyghur segmentation method produces a large number of semantic abstraction and even polysemous word features,so learning algorithms are difficult to find the hidden structure in the high-dimensional data. A segmentation approach dme-TS and a feature selection approach UMRMR-UFS based on unsupervised strategy are proposed. In dme-TS,the word based Bi-gram and contextual information are derived from large scale raw text corpus automatically,and the liner combinations of difference of t-test,mutual information and entropy of double word adjacency are taken as a measurement (dme) to estimate the agglutinative strength between two adjacent Uyghur words. In UMRMR-UFS,an improved unsupervised feature selection criterion (UMRMR) is proposed and the importance of each feature is estimated according to its minimum redundancy and maximum relevancy. The experimental result shows that dme-TS effectively reduces the dimensions of original feature set and improves the quality of the feature itself,and the learning algorithm represents its highest performance on the feature subset selected by UMRMR-UFS.

Key words： Uyghur Segmentation Mutual Information Difference of t-Test Entropy of Adjacency Unsupervised Feature Selection

Received: 14 August 2012

ZTFLH:

TP391

	Service

	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	TOHTI Turdi
	PATTA Akbarr
	HAMDULLA Askar

Cite this article:

TOHTI Turdi,PATTA Akbarr,HAMDULLA Askar. Unsupervised Uyghur Segmentation and Unsupervised Feature Selection[J]. , 2013, 26(9): 845-852.

URL:

http://manu46.magtech.com.cn/Jweb_prai/EN/ OR http://manu46.magtech.com.cn/Jweb_prai/EN/Y2013/V26/I9/845

[1] Sun Maosong,Xiao Ming,Tsou B K. Chinese Word Segmentation without Using Dictionary Based on Unsupervised Learning Strategy. Chinese Journal of Computers,2004,27(6): 736-742 (in Chinese)
(孙茂松,肖明,邹嘉彦.基于无指导学习策略的无词表条件下的汉语自动分词.计算机学报,2004,27(6): 736-742)
[2] Wang Sili,Wang Bin. A Chinese Overlapping Ambiguity Resolution Method Based on Coupling Degree of Double Characters. Journal of Chinese Information Processing,2007,21(5): 14-17 (in Chinese)
(王思力,王斌.基于双字耦合度的中文分词交叉歧义处理方法.中文信息学报,2007,21(5): 14-17)
[3] Fei Hongxiao,Kang Songlin,Zhu Xiaojuan,et al. Chinese Word Segmentation Research Based on Statistic the Frequency of the Word. Computer Engineering and Applications,2005,30(7): 67-69 (in Chinese)
(费洪晓,康松林,朱小娟,等.基于词频统计的中文分词的研究.计算机工程与应用,2005,30(7): 67-69)
[4] Wang Fang,Wan Changxuan.Chinese Integrated Word Identification Based on Confidence. Journal of Chinese Information Processing,2009,23(3): 17-23 (in Chinese)
(王芳,万常选.基于可信度的中文完整词自动识别.中文信息学报,2009,23(3): 17-23)
[5] He Saike,Wang Xiaojie,Dong Yuan,et al. Apply Normalized Accessory Variety in Chinese Word Segmentation. Journal of Chinese Information Processing,2010,24(1): 15-19 (in Chinese)
(何赛克,王小捷,董远,等.归一化的邻接变化数方法在中文分词中的应用.中文信息学报,2010,24(1): 15-19)
[6] Jiang Jianhong,Zhao Songzheng. Luo Mei. Analysis and Application of Chinese Word Segmentation Model which Consist of Dictionary and Statistics Method. Computer Engineering and Design,2012,33(1): 387-391 (in Chinese)
(蒋建洪,赵嵩正,罗玫.词典与统计方法结合的中文分词模型研究及应用.计算机工程与设计,2012,33(1): 387-391)
[7] Mitra P,Murthy C A,Pal S K. Unsupervised Feature Selection Using Feature Similarity. IEEE Trans on Pattern Analysis and Machine Intelligence,2002,24(3): 301-312
[8] He Zhongshi,Xu Zhejun. A New Method Unsupervised Feature Selection for Text Mining. Journal of Chongqing University: Natural Science Edition,2007,30(6): 77-79 (in Chinese)
(何中市,徐浙君.一种新型的文本无监督特征选择方法.重庆大学学报:自然科学版,2007,30(6): 77-79)
[9] Liu Tao,Wu Gongyi,Chen Zheng. An Effective Unsupervised Feature Selection Method for Text Clustering. Journal of Computer Research and Development,2005,42(3): 381-386 (in Chinese)
(刘涛,吴功宜,陈正.一种高效的用于文本聚类的无监督特征选择算法.计算机研究与发展,2005,42(3): 381-386)
[10] Zhu Haodong,Li Hongchan,Zhong Yong. New Unsupervised Feature Selection Method. Journal of University of Electronic Science and Technology of China,2010,39(3): 412-415 (in Chinese)
(朱颢东,李红婵,钟勇.新颖的无监督特征选择方法.电子科技大学学报,2010,39(3): 412-415)
[11] Ye Fei,Luo Jingqing,Yu Zhifu. Unsupervised Feature Selection Algorithm Based on Center Distance Ratio Principle. Computer Engineering and Applications,2009,45(4): 162-164 (in Chinese)
(叶菲,罗景青,俞志富.基于中心距离比值准则的无监督特征选择算法.计算机工程与应用,2009,45(4): 162-164)
[12] Wang Lianxi,Jiang Shengyi. Unsupervised Feature Selection Method for Categorical Features. Journal of Chinese Computer Systems,2011,32(1): 47-50 (in Chinese)
(王连喜,蒋盛益.面向分类特征的无监督特征选择方法研究.小型微型计算机系统,2011,32(1): 47-50)
[13] Guyon I,Elisseeff A. An Introduction to Variable and Feature Selection. Journal of Machine Learning Research,2003,27(3):1157-1182
[14] Church K,Gale W,Hanks P,et al. Using Statistics in Lexical Analysis // Uri Zernik. Lexical Acquisition: Exploiting On-line Resources to Build a Lexicon. Hillsdale,USA: Lawrence Erlbaum Associates,1991: 115-164
[15] He Min,Gong Caichun,Zhang Huaping,et al. Method of New Word Identification Based on Lager-Scale Corpus. Computer Engineering and Applications,2007,43(21): 157-159 (in Chinese)
(贺敏,龚才春,张华平,等.一种基于大规模语料的新词识别方法.计算机工程与应用,2007, 43(21): 157-159)
[16] Liu Tao,Liu Shengping,Chen Zheng,et al. An Evaluation on Feature Selection for Text Clustering // Proc of the 12th International Conference on Machine Learning. Washington,USA,2003: 488-495
[17] Yang Yiming,Pedersen J O. A Comparative Study on Feature Selection in Text Categorization // Proc of the 14th International Conference on Machine Learning. San Francisco,USA,1997: 412-420
[18] Peng Hanchuan,Long Fuhui,Ding Chris. Feature Selection Based on Mutual Information: Criteria of Max-Dependency,Max Relevance,and Min-Redundancy. IEEE Trans on Pattern Analysis and Machine Intelligence,2005,27(8): 1226-1238