Mongolian Word Segmentation Based on Statistical Language Model
HOU Hong-Xu1,2,3, LIU Qun1, Nasanurtu2, Murengaowa2, LI Jin-Tao1
1.Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190 2.School of Computer Science, Inner Mongolia University, Huhhot 010021 3.Graduate University of Chinese Academy of Sciences, Beijing 100190
Abstract:Based on the analysis of Mongolian segmentation technique and the rules used as the foundation of word segmentation, a hybrid word segmentation method is proposed. It uses Mongolian statistical language model to eliminate the ambiguity in Mongolian word segmentation. A POS language model and a Skip-N language model are used, and an experiment system is thus created. The experimental results are better than those of the system based on rules.
侯宏旭,刘群,那顺乌日图,牧仁高娃,李锦涛. 基于统计语言模型的蒙古文词切分*[J]. 模式识别与人工智能, 2009, 22(1): 108-112.
HOU Hong-Xu, LIU Qun, Nasanurtu, Murengaowa, LI Jin-Tao. Mongolian Word Segmentation Based on Statistical Language Model. , 2009, 22(1): 108-112.
[1] Nasanurtu. A Segmentation System of Mongolian Etyma, Stem and Affix. Journal of Inner Mongolia University: Humanities and Social Sciences, 1997, 29(2): 53-57 (in Chinese) (那顺乌日图.蒙古文词根、词干、词尾自动切分系统.内蒙古大学学报:人文社会科学版, 1997, 29(2): 53-57) [2] Hua Shabao. The POS Tagger System for Mongolian Corpus. Journal of Inner Mongolia University: Humanities and Social Sciences, 1999, 31(5): 33-37 (in Chinese) (华沙宝.对蒙古文语料库的词类标注系统——AYIMAG.内蒙古大学学报: 人文社会科学版, 1999, 31(5): 33-37) [3] Hou Hongxu, Liu Qun, Zhang Yujie, et al. Research and Implement of the 2005 HTRDP(863) Evaluation on Machine Translation. Journal of Chinese Information Processing, 2006, 20(Z1): 7-18 (in Chinese) (侯宏旭,刘 群,张玉洁,等.2005年度863机器翻译评测方法研究与实施.中文信息学报, 2006, 20(Z1): 7-18) [4] Badma-Odsar. A Study of Part of Speech Classification of Mongolian Language. Journal of the Central University for Nationalities: Philosophy and Social Sciences Edition, 2004, 31(3): 94-100 (in Chinese) (巴达玛敖德斯尔.面向信息处理的蒙古语词语分类体系研究.中央民族大学学报:哲学社会科学版, 2004, 31(3): 94-100) [5] Nasanurtu. Semantic Research for the Mongolian Language to Be Oriented to Information Processing. Journal of Inner Mongolia University: Humanities and Social Sciences, 2002, 34(5): 43-48 (in Chinese) (那顺乌日图.关于面向信息处理的蒙古语语义研究.内蒙古大学学报:人文社会科学版, 2002, 34(5): 43-48) [6] Hua Shabao. The Technological Countermeasure to Deal with the Net Information in Mongolian. Minority Languages of China, 2002, 6: 58-60 (in Chinese) (华沙宝.蒙古文网络信息技术处理的对策.民族语文, 2002, 6: 58-60) [7] Hou Hongxu, Deng Dan, Zou Gang, et al. An EBMT System Based on Word Alignment // Proc of the 4th International Workshop of Spoken Language Translation. Trento, Italy, 2004: 47-49 [8] Zhang Huaping, Yu Hongkui, Xiong Deyi, et al. HHMM-Based Chinese Lexical Analyzer ICTCLAS // Proc of the 2nd SIGHAN Workshop on Chinese Language Processing. Sapporo, Japan, 2003: 184-187 [9] Ye Jiaming. Research and Implement of Mongolian Lexical Analysis Based on Rules. Master Dissertation. Beijing, China: Peking University. School of Electronics Engineering and Computer Science, 2005 (in Chinese) (叶嘉明.基于规则的蒙古语词法分析研究与实现.硕士学位论文.北京:北京大学.信息科学技术学院, 2005) [10] Liu Qun, Zhan Weidong, Chang Baobao, et al. Computing Model and Language Model of Chinese-English Translation System // Proc of the 3rd Intelligent Interface and Intelligent Application. Zhangjiajie, China, 1997: 253-258 (in Chinese) (刘 群,詹卫东,常宝宝,等.一个汉英机器翻译系统的计算模型与语言模型//第3届全国智能接口与智能应用学术会议.张家界, 1997: 253-258) [11]Hou Hongxu, Liu Qun, Liu Zhiwen. Skip-N Mongolian Statistical Language Model. Journal of Inner Mongolia University: Natural Sciences, 2008, 39(2): 220-224 (in Chinese) (侯宏旭,刘 群,刘志文.Skip-N蒙古文统计语言模型.内蒙古大学学报:自然科学版, 2008, 39(2): 220-224) [12] Katz S M. Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer. IEEE Trans on Acoustics, Speech and Signal Processing, 1987, 35(3): 400-401 [13] Och F J. Minimum Error Rate Training in Statistical Machine Translation // Proc of the 41st Annual Meeting on Association for Computational Linguistics. Sapporo, Japan, 2003: 160-167