An Overview of Natural Language Processing for Indonesian and Malay
JIANG Shengyi1,2, LI Shanshan1,2, FU Sihui1, LIN Nankai1,2
1. School of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou 510006 2. Guangzhou Key Laboratory of Multilingual Intelligent Processing, Guangdong University of Foreign Studies, Guangzhou 510006
Abstract:As the penetration rate of Indonesian and Malay rises, it is significant to carry out information processing on massive texts of these two languages. Extensive research is conducted on Indonesian and Malay. However, as low-resource languages, Indonesian and Malay draw less attention than common languages. Thus, the deep learning methods cannot be fully utilized. In this paper, research on Indonesian and Malay morphological analysis, syntactic parsing, machine translation, spelling check etc., is analyzed and summarized. In the most research findings, algorithms cannot be compared objectively due to their different corpus scales and evaluation metrics. Finally, problems and future directions of natural language processing on Indonesian and Malay are discussed with the consideration of the existing open language resources in various fields.
蒋盛益, 李珊珊, 符斯慧, 林楠铠. 印尼语、马来语自然语言处理研究综述[J]. 模式识别与人工智能, 2020, 33(6): 530-541.
JIANG Shengyi, LI Shanshan, FU Sihui, LIN Nankai. An Overview of Natural Language Processing for Indonesian and Malay. , 2020, 33(6): 530-541.
[1] INDRADJAJA L S, BRESSAN S. Automatic Learning of Stemming Rules for the Indonesian Language // Proc of the 17th Pacific Asia Conference on Language, Information and Computation. Stroudsburg, USA: ACL, 2003: 62-68. [2] JURAFSKY D, JAMES M. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistic, and Speech Recognition. Upper Saddle River, USA: Prentice Hall, 2000. [3] ADRIANI M, ASIAN J, NAZIEF B, et al. Stemming Indonesian: A Confix-Stripping Approach. ACM Transactions on Asian Language Information Processing, 2007, 6(4). DOI: 10.1145/1316457.1316459. [4] PISCELDO F, MAHENDRA R, MANURUNG R, et al. A Two-Level Morphological Analyser for the Indonesian Language[C/OL]. [2020-02-28]. https://www.aclweb.org/anthology/U08-1018.pdf. [5] LARASATI S D, KUBON V, ZEMAN D. Indonesian Morphology Tool(MorphInd): Towards an Indonesian Corpus // Proc of the International Workshop on Systems and Frameworks for Computational Morphology. Berlin, Germany: Springer, 2011: 119-129. [6] SETIAWAN R, KURNIAWAN A, BUDIHARTO W, et al. Flexible Affix Classification for Stemming Indonesian Language // Proc of the 13th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology. Wa-shington, USA: IEEE, 2016. DOI: 10.1109/ECTICon.2016.7561257. [7] SODHY G C. Prefix Extraction of Malay Words Using Backpropagation Neural Network[C/OL]. [2002-02-28]. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.119.7104&rep=rep1&type=pdf. [8] SULAIMAN S, GASSER M, KÜBLER S. Towards a Malay Derivational Lexicon: Learning Affixes Using Expectation Maximization // Proc of the 2nd Workshop on South and Southeast Asian Natural Language Processing. Stroudsburg, USA: ACL, 2011: 30-34. [9] FU S H, LIN N K, ZHU G Q, et al. Towards Indonesian Part-of-Speech Tagging: Corpus and Models[C/OL]. [2020-02-28]. http://lrec-conf.org/workshops/lrec2018/W34/pdf/3_W34.pdf. [10] PISCELDO F, ADRIANI M, MANURUNG R. Probabilistic Part of Speech Tagging for Bahasa Indonesia[C/OL]. [2020-02-28]. http://ww.panl10n.net/english/outputs/Indonesia/UI/0901/UI-POSTAG.pdf. [11] WICAKSONO A F, PURWARIANTI A. HMM Based Part-of-Speech Tagger for Bahasa Indonesia[C/OL]. [2020-02-28]. http://mail.informatika.org/~ayu/2010postagger.pdf. [12] PURWARIANTI A, SAEIAN A, AFIFI I, et al. Natural Language Understanding Tools with Low Language Resource in Building Automatic Indonesian Mind Map Generator. International Journal on Electrical Engineering and Informatics, 2013, 5(3): 256-269. [13] DINAKARAMANI A, FAM R, LUTHFI A, et al. Designing an Indonesian Part of Speech Tagset and Manually Tagged Indonesian Corpus // Proc of the International Conference on Asian Language Processing. Washington, USA: IEEE, 2014: 66-69. [14] FAM R, LUTHFI A, DINAKARAMANI A, et al. Building an Indonesian Rule-Based Part-of-Speech Tagger // Proc of the International Conference on Asian Language Processing. Washington, USA: IEEE, 2014: 70-73. [15] PURNAMASARI K K, SUWARDI I S. Rule-Based Part of Speech Tagger for Indonesian Language. IOP Conference Series: Materials Science and Engineering, 2018, 407(1). DOI: 10.1088/1757-899X/407/1/012151. [16] TANADI T. Time Series Neural Network Model for Part-of-Speech Tagging Indonesian Language. IOP Conference Series: Materials Science and Engineering, 2018, 325(1). DOI: 10.1088/1757-899X/325/1/012025. [17] MOHAMED H, OMAR N, AB AZIZ M J. Statistical Malay Part-of-Speech(POS) Tagger Using Hidden Markov Approach // Proc of the International Conference on Semantic Technology and Information Retrieval. Washington, USA: IEEE, 2011: 231-236. [18] ZAMIN N, BAKAR Z A. A Cross-Lingual Part-of-Speech Tagging for Malay Language // Proc of the International Conference on Agents and Artificial Intelligence. Berlin, Germany: Springer, 2015: 232-240. [19] XIAN B C M, LUBANI M, LIEW K, et al. Benchmarking Mi-POS: Malay Part-of-Speech Tagger. International Journal of Know-ledge Engineering, 2016, 2(3): 115-121. [20] ARIFFIN S N A N, TIUM S. Part-of-Speech Tagger for Malay Social Media Texts. GEMA Online Journal of Language Studies, 2018, 18(4): 124-142. [21] IRMAWATI B, SHINDO H, MATSUMOTO Y. A Dependency Annotation Scheme for Indonesian // Proc of the 21st Annual Meeting of the Association for Natural Language Processing for Japan. Berlin, Germany: Springer, 2015: 740-743. [22] GUSMITA R H, MANURUNG R. Some Initial Experiments with Indonesian Probabilistic Parsing[C/OL]. [2020-02-28]. http://bahasa.cs.ui.ac.id/pub/malindo08probparse.pdf. [23] JOICE. Pengembangan Lanjut Pengurai Struktur Kalimat Bahasa Indonesia Yang Menggunakan Constraint-Based Formalism. Undergraduate Dissertation. Jakarta, Indonesia: University of Indonesia, 2002. [24] GREEN N, LARASATI S D, ABOKRTSKY Z. Indonesian Dependency Treebank: Annotation and Parsing // Proc of the 26th Pacific Asia Conference on Language, Information and Computation. Stroudsburg, USA: ACL, 2012: 137-145. [25] IRMAWATI B, SHINDO H, MATSUMOTO Y. A Dependency Annotation Scheme to Extract Syntactic Features in Indonesian Sentences. International Journal of Technology, 2017, 8(5): 957-967. [26] ARMAN A A, PUTRA N A B, PURWARIANTI A, et al. Syntactic Phrase Chunking for Indonesian Language. Procedia Technology, 2013, 11: 635-640. [27] FACHRURROZI M, YUSLINI N, AGUSTIN M M. Identification of Ambiguous Sentence Pattern in Indonesian Using Shift-Reduce Parsing[C/OL]. [2020-02-28]. http://pdfs.semanticscholar.org/5ce8/07d7c7954367bb151117cbec661d41d41939.pdf. [28] HERLIM R S, PURWARIANTI A. Indonesian Shift-Reduce Constituency Parser Using Feature Templates & Beam Search Strategy // Proc of the 5th International Conference on Advanced Informatics: Concept Theory and Applications. Washington, USA: IEEE, 2018: 54-59. [29] RAHMAN A, PURWARIANTI A. Ensemble Technique Utilization for Indonesian Dependency Parser // Proc of the 31st Pacific Asia Conference on Language, Information and Computation. Stroudsburg, USA: ACL, 2017: 64-71. [30] ABIDIN A I Z, YONG S P, KASBON R, et al. Utilizing Top-Down Parsing Technique in the Development of a Malay Language Sentence Parser // Proc of the 2nd International Conference on Informatics. Berlin, Germany: Springer, 2007: 125-131. [31] NOOR Y M, JAMALUDIN Z. Parser with Sentence Correction for Malay Language(BM) // Proc of the International Conference on Information and Knowledge Management. Singapore, Singapore: IACSIT Press, 2012: 138-142. [32] HILOH M A F, AB AZIZ M J, ZAKARIA L Q. The Effectiveness of Bottom up Technique with Probabilistic Approach for a Malay Parser. GEMA Online Journal of Language Studies, 2018, 18(2): 124-133. [33] NOOR N H B M, SAPUAN S, BOND F. Creating the Open Wordnet Bahasa // Proc of the 25th Pacific Asia Conference on Language, Information and Computation. Stroudsburg, USA: ACL, 2011: 255-264. [34] MAHENDRA R, SEPTIANTRI H, WIBOWO H A, et al. Cross-Lingual and Supervised Learning Approach for Indonesian Word Sense Disambiguation Task // Proc of the 9th Global WordNet Conference. Berlin, Germany: Springer, 2018: 248-253. [35] CHU B, LIU Q, MAHMUD R, et al. Malay Semantic Text Processing Engine // Proc of the 6th International Conference on Information, Process, and Knowledge Management. Berlin, Germany: Springer, 2014: 38-43. [36] Final Report on Statistical Machine Translation for Bahasa Indonesia-English and English-Bahasa Indonesia(E-BI)[R/OL]. [2020-02-28]. http://www.pan110n.net/english/outputs/Indonesia/BPPT/0902/SMTFinalReport.pdf. [37] SUSANTO R H, LARASATI S D, TYERS F M. Rule-Based Machine Translation between Indonesian and Malaysian // Proc of the 3rd Workshop on South and Southeast Asian Natural Language Processing. Stroudsburg, USA: ACL, 2012: 191-200. [38] HERMANTO A, ADJI T B, SETIAWAN N A. Recurrent Neural Network Language Model for English-Indonesian Machine Translation: Experimental Study // Proc of the International Conference on Science in Information Technology. Washington, USA: IEEE, 2015: 132-136. [39] YEONG Y L, TAN T P, GAN K H, et al. Hybrid Machine Translation with Multi-source Encoder-Decoder Long Short-Term Memory in English-Malay Translation. International Journal on Advanced Science, Engineering and Information Technology, 2018, 8(4-2): 1446-1452. [40] WANG P D, NAKOV P, NG H T. Source Language Adaptation Approaches for Resource-Poor Machine Translation. Computational Linguistics, 2016, 42(2): 277-306. [41] OCTOVIANI W, FACHRURROZI M, YUSLIANI N, et al. English-Indonesian Phrase Translation Using Recurrent Neural Network and ADJ Technique[C/OL]. [2020-02-28]. http://iopscience.iop.org/article/10.1088/1742-6596/1196111012007/pdf. [42] YUSOFF N, JAMALUDIN Z, YUSOFF M H. Semantic-Based Malay-English Translation Using n-Gram Model. Journal of Telecommunication, Electronic and Computer Engineering, 2016, 8(10): 117-123. [43] YEONG Y L, TAN T P, MOHAMMAD S K. Using Dictionary and Lemmatizer to Improve Low Resource English-Malay Statistical Machine Translation System. Procedia Computer Science, 2016, 81: 243-249. [44] SOLEH M Y, PURWARIANTI A. A Non Word Error Spell Checker for Indonesian Using Morphologically Analyzer and HMM // Proc of the International Conference on Electrical Engineering and Informa-tics. Washington, USA: IEEE, 2011. DOI: 10.1109/ICEEI.2011.6021514. [45] IRMAWATI B, SHINDO H, MATSUMOTO Y. Exploiting Syntactic Similarities for Preposition Error Corrections on Indonesian Sentences Written by Second Language Learner. Procedia Computer Science. 2016, 81: 214-220. [46] FAHDA A, PURWARIANTI A. A Statistical and Rule-Based Spe-lling and Grammar Checker for Indonesian Text[C/OL]. [2020-02-28]. DOI: 10.1109/ICODSE.2017.8285846. [47] MAWARDI V C, SUSANTO N, NAGA D S. Spelling Correction for Text Documents in Bahasa Indonesia Using Finite State Automata and Levinshtein Distance Method[C/OL]. [2020-02-28]. https://www.matec-conferences.org/articles/matecconf/pdf/2018/23/matecconf_icesti2018_01047.pdf. [48] KASBON R, AMRAN N A, MAZLAN E M, et al. Malay Language Sentence Checker. World Applies Sciences Journal, 2011, 12: 19-25. [49] BASRI S B, ALFRED R, ON C K. Automatic Spell Checker for Malay Blog // Proc of the IEEE International Conference on Control System, Computing and Engineering. Washington, USA: IEEE, 2012: 506-510. [50] NOOR Y B M, JAMALUDIN Z. Parse Tree Visualization for Malay Sentence(BMTutor). ARPN Journal of Engineering and Applied Sciences, 2015, 10(3): 1253-1259. [51] WICAKSONO A F, VANIA C, DISTIAWAN T B, et al. Automa-tically Building a Corpus for Sentiment Analysis on Indonesian Tweets // Proc of the 28th Pacific Asia Conference on Language, Information and Computing. Stroudsburg, USA: ACL, 2014: 185-194. [52] FRANKY, BOJAR O, VESELOVSKÁ K. Resources for Indonesian Sentiment Analysis. The Prague Bulletin of Mathematical Linguistics, 2015, 103(1): 21-41. [53] KOTO F, RAHMANINGTYAS G Y. InSet Lexicon: Evaluation of a Word List for Indonesian Sentiment Analysis in Microblogs // Proc of the 21st International Conference on Asian Language Processing. Berlin, Germany: Springer, 2017: 391-394. [54] LUNANDO E, PURWARIANTI A. Indonesian Social Media Sentiment Analysis with Sarcasm Detection // Proc of the International Conference on Advanced Computer Science and Information Systems. Washington, USA: IEEE, 2013: 195-198. [55] EFFENDY V, NOVANTIRANI A, SABARIAH M K. Sentiment Analysis on Twitter about the Use of City Public Transportation Using Support Vector Machine Method. International Journal on Information and Communication Technology, 2016, 2(1): 57-66. [56] FAUZI M A. Random Forest Approach for Sentiment Analysis in Indonesian. Indonesian Journal of Electrical Engineering and Computer Science, 2018, 12(1): 46-50. [57] ILMANIA A, ABDURRAHMAN, CAHYAWIJAYA S, et al. Aspect Detection and Sentiment Classification Using Deep Neural Network for Indonesian Aspect-Based Sentiment Analysis // Proc of the International Conference on Asian Language Processing. Wa-shington, USA: IEEE, 2018: 62-67. [58] HIJAZI M H A, LIBIN L, ALFRED R, et al. Bias Aware Lexicon-Based Sentiment Analysis of Malay Dialect on Social Media Data: A Study on the Sabah Language // Proc of the 2nd International Conference on Science in Information Technology. Washington, USA: IEEE, 2016: 356-361. [59] SADANANDAN A A, OSMAN N A, SAIFUDDIN H, et al. Improving Accuracy in Sentiment Analysis for Malay Language // Proc of the 4th International Conference on Artificial Intelligence and Computer Science. Berlin, Germany: Springer, 2016: 54-64. [60] AL-SAFFAR A, AWANG S, TAO H, et al. Malay Sentiment Analysis Based on Combined Classification Approaches and Senti-Lexicon Algorithm. PloS one, 2018, 13(4). DOI: 10.1371/journal.pone.0194852. [61] FACHRURROZI M, YUSLIANI N, YOANITA R U. Frequent Term Based Text Summarization for Bahasa Indonesia // Proc of the International Conference on Innovations in Engineering and Technology. Berlin, Germany: Springer, 2013: 30-32. [62] SILVIA, RUKMANA P, APRILIA V R, et al. Summarizing Text for Indonesian Language by Using Latent Dirichlet Allocation and Genetic Algorithm // Proc of the International Conference on Electrical Engineering, Computer Science and Informatics. Berlin, Germany: Springer, 2014: 148-153. [63] NAJIBULLAH A. Indonesian Text Summarization based on Naive Bayes Method // Proc of the International Seminar and Confe-rence. Berlin, Germany: Springer, 2015: 67-78. [64] GUNAWAN D, PASARIBU A, RAHMAT R F, et al. Automatic Text Summarization for Indonesian Language Using TextTeaser[C/OL]. [2002-02-28]. https://iopscience.iop.org/article/10.1088/1757-899X/190/1/012048/pdf. [65] SLAMET C, ATMADJA A R, MAYLAWATI D S, et al. Automated Text Summarization for Indonesian Article Using Vector Space Model[C/OL]. [2002-02-28]. https://iopscience.iop.org/article/10.1088/1757-899X/288/1/012037/pdf. [66] MASSANDY D T, KHODRA M L. Guided Summarization for Indonesian News Articles // Proc of the International Conference of Advanced Informatics: Concept, Theory and Application. Washington, USA: IEEE, 2014: 140-145. [67] KOTO F. A Publicly Available Indonesian Corpora for Automatic Abstractive and Extractive Chat Summarization // Proc of the 10th International Conference on Language Resources and Evaluation. Berlin, Germany: Springer, 2016: 801-805. [68] KURNIAWAN K, LOUVAN S. INDOSUM: A New Benchmark Dataset for Indonesian Text Summarization[C/OL]. [2020-02-28]. https://arxiv.org/pdf/1810.05334v1.pdf. [69] CAI Z F, LIN N K, MA C Y, et al. Indonesian Automatic Text Summarization Based on a New Clustering Method in Sentence Level // Proc of the International Conference on Big Data Engineering. New York, USA: ACM, 2019: 30-35. [70] GUNAWAN W, SUHARTONO D, PURNOMO F, et al. Named-Entity Recognition for Indonesian Language Using Bidirectional LSTM-CNNs. Procedia Computer Science, 2018, 135: 425-432. [71] WIBAWA A S, PURWARIANTI A. Indonesian Named-Entity Re-cognition for 15 Classes Using Ensemble Supervised Learning. Procedia Computer Science, 2016, 81: 221-228. [72] ALFRED R, LEONG L C, ON C K, et al. Malay Named Entity Recognition Based on Rule-Based Approach. International Journal of Machine Learning and Computing, 2014, 4(3): 300-306. [73] 郑铿涛,林楠铠,付颖雯,等.汉语-印尼语平行语料自动对齐方法研究.广西师范大学学报(自然科学版), 2019, 37(1): 89-97. (ZHENG K T, LIN N K, FU Y W, et al. Study on the Automatic Alignment of Mandarin-Indonesian Bilingual Texts. Journal of Guangxi Normal University(Nature Science Edition), 2019, 37(1): 89-97.) [74] LIN N K, FU S H, JIANG S Y, et al. Learning Indonesian Frequently Used Vocabulary from Large-Scale News // Proc of the International Conference on Asian Language Processing. Washington, USA: IEEE, 2018: 234-239. [75] LIN N K, FU S H, ZHU G Q, et al. Exploring Lexical Differences between Indonesian and Malay // Proc of the International Conference on Asian Language Processing. Washington, USA: IEEE, 2018: 178-183. [76] QIU X Y, ZHU G Q. Learning Indonesian-Chinese Lexicon with Bilingual Word Embedding Models and Monolingual Signals // Proc of the 6th Workshop on South and Southeast Asian Natural Language Processing. Stroudsburg, USA: ACL, 2016: 188-193.