Abstract:Commonly used Uyghur segmentation method produces a large number of semantic abstraction and even polysemous word features,so learning algorithms are difficult to find the hidden structure in the high-dimensional data. A segmentation approach dme-TS and a feature selection approach UMRMR-UFS based on unsupervised strategy are proposed. In dme-TS,the word based Bi-gram and contextual information are derived from large scale raw text corpus automatically,and the liner combinations of difference of t-test,mutual information and entropy of double word adjacency are taken as a measurement (dme) to estimate the agglutinative strength between two adjacent Uyghur words. In UMRMR-UFS,an improved unsupervised feature selection criterion (UMRMR) is proposed and the importance of each feature is estimated according to its minimum redundancy and maximum relevancy. The experimental result shows that dme-TS effectively reduces the dimensions of original feature set and improves the quality of the feature itself,and the learning algorithm represents its highest performance on the feature subset selected by UMRMR-UFS.
[1] Sun Maosong,Xiao Ming,Tsou B K. Chinese Word Segmentation without Using Dictionary Based on Unsupervised Learning Strategy. Chinese Journal of Computers,2004,27(6): 736-742 (in Chinese) (孙茂松,肖 明,邹嘉彦.基于无指导学习策略的无词表条件下的汉语自动分词.计算机学报,2004,27(6): 736-742) [2] Wang Sili,Wang Bin. A Chinese Overlapping Ambiguity Resolution Method Based on Coupling Degree of Double Characters. Journal of Chinese Information Processing,2007,21(5): 14-17 (in Chinese) (王思力,王 斌.基于双字耦合度的中文分词交叉歧义处理方法.中文信息学报,2007,21(5): 14-17) [3] Fei Hongxiao,Kang Songlin,Zhu Xiaojuan,et al. Chinese Word Segmentation Research Based on Statistic the Frequency of the Word. Computer Engineering and Applications,2005,30(7): 67-69 (in Chinese) (费洪晓,康松林,朱小娟,等.基于词频统计的中文分词的研究.计算机工程与应用,2005,30(7): 67-69) [4] Wang Fang,Wan Changxuan.Chinese Integrated Word Identification Based on Confidence. Journal of Chinese Information Processing,2009,23(3): 17-23 (in Chinese) (王 芳,万常选.基于可信度的中文完整词自动识别.中文信息学报,2009,23(3): 17-23) [5] He Saike,Wang Xiaojie,Dong Yuan,et al. Apply Normalized Accessory Variety in Chinese Word Segmentation. Journal of Chinese Information Processing,2010,24(1): 15-19 (in Chinese) (何赛克,王小捷,董 远,等.归一化的邻接变化数方法在中文分词中的应用.中文信息学报,2010,24(1): 15-19) [6] Jiang Jianhong,Zhao Songzheng. Luo Mei. Analysis and Application of Chinese Word Segmentation Model which Consist of Dictionary and Statistics Method. Computer Engineering and Design,2012,33(1): 387-391 (in Chinese) (蒋建洪,赵嵩正,罗 玫.词典与统计方法结合的中文分词模型研究及应用.计算机工程与设计,2012,33(1): 387-391) [7] Mitra P,Murthy C A,Pal S K. Unsupervised Feature Selection Using Feature Similarity. IEEE Trans on Pattern Analysis and Machine Intelligence,2002,24(3): 301-312 [8] He Zhongshi,Xu Zhejun. A New Method Unsupervised Feature Selection for Text Mining. Journal of Chongqing University: Natural Science Edition,2007,30(6): 77-79 (in Chinese) (何中市,徐浙君.一种新型的文本无监督特征选择方法.重庆大学学报:自然科学版,2007,30(6): 77-79) [9] Liu Tao,Wu Gongyi,Chen Zheng. An Effective Unsupervised Feature Selection Method for Text Clustering. Journal of Computer Research and Development,2005,42(3): 381-386 (in Chinese) (刘 涛,吴功宜,陈 正.一种高效的用于文本聚类的无监督特征选择算法.计算机研究与发展,2005,42(3): 381-386) [10] Zhu Haodong,Li Hongchan,Zhong Yong. New Unsupervised Feature Selection Method. Journal of University of Electronic Science and Technology of China,2010,39(3): 412-415 (in Chinese) (朱颢东,李红婵,钟 勇.新颖的无监督特征选择方法.电子科技大学学报,2010,39(3): 412-415) [11] Ye Fei,Luo Jingqing,Yu Zhifu. Unsupervised Feature Selection Algorithm Based on Center Distance Ratio Principle. Computer Engineering and Applications,2009,45(4): 162-164 (in Chinese) (叶 菲,罗景青,俞志富.基于中心距离比值准则的无监督特征选择算法.计算机工程与应用,2009,45(4): 162-164) [12] Wang Lianxi,Jiang Shengyi. Unsupervised Feature Selection Method for Categorical Features. Journal of Chinese Computer Systems,2011,32(1): 47-50 (in Chinese) (王连喜,蒋盛益.面向分类特征的无监督特征选择方法研究.小型微型计算机系统,2011,32(1): 47-50) [13] Guyon I,Elisseeff A. An Introduction to Variable and Feature Selection. Journal of Machine Learning Research,2003,27(3):1157-1182 [14] Church K,Gale W,Hanks P,et al. Using Statistics in Lexical Analysis // Uri Zernik. Lexical Acquisition: Exploiting On-line Resources to Build a Lexicon. Hillsdale,USA: Lawrence Erlbaum Associates,1991: 115-164 [15] He Min,Gong Caichun,Zhang Huaping,et al. Method of New Word Identification Based on Lager-Scale Corpus. Computer Engineering and Applications,2007,43(21): 157-159 (in Chinese) (贺 敏,龚才春,张华平,等.一种基于大规模语料的新词识别方法.计算机工程与应用,2007, 43(21): 157-159) [16] Liu Tao,Liu Shengping,Chen Zheng,et al. An Evaluation on Feature Selection for Text Clustering // Proc of the 12th International Conference on Machine Learning. Washington,USA,2003: 488-495 [17] Yang Yiming,Pedersen J O. A Comparative Study on Feature Selection in Text Categorization // Proc of the 14th International Conference on Machine Learning. San Francisco,USA,1997: 412-420 [18] Peng Hanchuan,Long Fuhui,Ding Chris. Feature Selection Based on Mutual Information: Criteria of Max-Dependency,Max Relevance,and Min-Redundancy. IEEE Trans on Pattern Analysis and Machine Intelligence,2005,27(8): 1226-1238