Siamese Contrastive Network Based Multilingual Parallel Sentence Pair Extraction between Chinese and Southeast Asian Languages
ZHOU Yuanzhuo1,2, MAO Cunli1,2, SHEN Zheng1,2, ZHANG Siqi1,2, YU Zhengtao1,2, WANG Zhenhan1,2
1. Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650504; 2. Key Laboratory of Artificial Intelligence in Yunnan Province, Kunming University of Science and Technology, Kunming 650504
Abstract:The poor performance of parallel sentence pair extraction application on Southeast Asian languages with scarce resources is primarily due to the weak representation capabilities of the sentence pair extraction models caused by the lack of training corpora. Therefore, a siamese contrastive network based multilingual parallel sentence pair extraction between Chinese and Southeast Asian languages is proposed to optimize model structure, training strategy and data. Firstly, a siamese contrastive network framework is employed, integrating contrastive learning concept into the siamese network to enhance the representation capability for parallel sentence pairs. Next, a strategy of joint training with similar languages is introduced to share knowledge effectively and improve the learning ability of the model. Finally, Chinese-mixed Southeast Asian parallel sentence pairs are constructed by multilingual word replacement, providing abundant sample information for training. Experiments on Chinese-Thai and Chinese-Lao datasets demonstrate that the proposed method effectively enhances the performance of parallel sentence pair extraction.
[1] BOUAMOR H, SAJJAD H. H2@BUCC18: Parallel Sentence Extraction from Comparable Corpora Using Multilingual Sentence Embeddings[C/OL]. [2023-05-16]. http://lrec-conf.org/workshops/lrec2018/W8/pdf/8_W8.pdf. [2] SMITH J R, QUIRK C, TOUTANOVA K. Extracting Parallel Sentences from Comparable Corpora Using Document Level Alignment // Proc of the Annual Conference of the North American Chapter of the Association for Computational Linguistics. Stroudsburg, USA: ACL, 2010: 403-411. [3] GRÉGOIRE F, LANGLAIS P. A Deep Neural Network Approach to Parallel Sentence Extraction[C/OL].[2023-2-16]. https://arxiv.org/abs/1709.09783v1. [4] PIRES T, SCHLINGER E, GARRETTE D. How Multilingual Is Multilingual BERT? // Proc of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, USA: ACL, 2019: 4996-5001. [5] XUE L T, CONSTANT N, ROBERTS A, et al. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer // Proc of the Conference of the North American Chapter of the Association for Computational Linguistics(Human Language Technologies). Strouds-burg, USA: ACL, 2021: 483-498. [6] SONG Z J, HU Z Z, HONG R C. Embedded Heterogeneous Attention Transformer for Cross-Lingual Image Captioning[C/OL]. [2023-2-16]. https://arxiv.org/pdf/2307.09915.pdf. [7] ZHU J N, WANG Q, WANG Y N, et al. NCLS: Neural Cross-Lingual Summarization // Proc of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg, USA: ACL, 2019: 3054-3064. [8] GROVER J, MITRA P. Bilingual Word Embeddings with Bucketed CNN for Parallel Sentence Extraction // Proc of ACL 2017(Student Research Workshop). Stroudsburg, USA: ACL, 2017: 11-16. [9] ZHU S L, YANG Y, XU C. Extracting Parallel Sentences from Nonparallel Corpora Using Parallel Hierarchical Attention Network. Computational Intelligence and Neuroscience, 2020. DOI: 10.1155/2020/8823906. [10] HANGYA V, BRAUNE F, KALASOUSKAYA Y, et al. Unsupervised Parallel Sentence Extraction from Comparable Corpora // Proc of the 15th International Conference on Spoken Language Translation. Stroudsburg, USA: ACL, 2018: 7-13. [11] HANGYA V, FRASER A. Unsupervised Parallel Sentence Extraction with Parallel Segment Detection Helps Machine Translation // Proc of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, USA: ACL, 2019: 1224-1234. [12] KEUNG P, SALAZAR J, LU Y C, et al. Unsupervised Bitext Mining and Translation via Self-Trained Contextual Embeddings. Transactions of the Association for Computational Linguistics, 2020, 8: 828-841. [13] KVAPILÍKOVÁ I, ARTETXE M, LABAKA G, et al. Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining // Proc of the 58th Annual Meeting of the Association for Computational Linguistics(Student Research Workshop). Stroudsburg, USA: ACL, 2020: 255-262. [14] TIEN C, STEINERT-THRELKELD S. Bilingual Alignment Transfers to Multilingual Alignment for Unsupervised Parallel Text Mining // Proc of the 60th Annual Meeting of the Association for Computational Linguistics(Long Papers). Stroudsburg, USA: ACL, 2022: 8696-8706. [15] CONNEAU A, KHANDELWAL K, GOYAL N, et al. Unsupervised Cross-Lingual Representation Learning at Scale // Proc of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, USA: ACL, 2020: 8440-8451. [16] DING C C, UTIYAMA M, SUMITA E. Similar Southeast Asian Languages: Corpus-Based Case Study on Thai-Laotian and Malay-Indonesian // Proc of the 3rd Workshop on Asian Translation. Stroudsburg, USA: ACL, 2016: 149-156. [17] CHEN T, KORNBLITH S, NOROUZI M, et al. A Simple Framework for Contrastive Learning of Visual Representations // Proc of the 37th International Conference on Machine Learning. San Diego, USA: JMLR, 2020: 1597-1607. [18] HE K M, FAN H Q, WU Y X, et al. Momentum Contrast for Unsupervised Visual Representation Learning // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2020: 9726-9735. [19] CHEN T, KORNBLITH S, SWERSKY K, et al. Big Self-Supervised Models Are Strong Semi-Supervised Learners // Proc of the 34th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2020: 22243-22255. [20] ZHANG Y, HE R D, LIU Z Z, et al. An Unsupervised Sentence Embedding Method by Mutual Information Maximization // Proc of the Conference on Empirical Methods in Natural Language Proce-ssing. Stroudsburg, USA: ACL, 2020: 1601-1610. [21] FANG H C, WANG S C, ZHOU M, et al. CERT: Contrastive Self-Supervised Learning for Language Understanding[C/OL].[2023-2-16]. https://arxiv.org/pdf/2005.12766.pdf. [22] WANG L, ZHAO W, LIU J M. Aligning Cross-Lingual Sentence Representations with Dual Momentum Contrast // Proc of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg, USA: ACL, 2021: 3807-3815. [23] CHENG Y T, WEI F Y, BAO J M, et al. CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive Learning // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023: 19016-19026. [24] 陈庆宇,季繁繁,袁晓彤. 基于伪孪生网络双层优化的对比学习.模式识别与人工智能, 2022, 35(10): 928-938. (CHEN Q Y, JI F F, YUAN X T. Contrastive Learning Based on Bilevel Optimization of Pseudo Siamese Networks. Pattern Recognition and Artificial Intelligence, 2022, 35(10): 928-938.) [25] 陈谨雯,陈羽中. 用于多跳阅读理解的双视图对比学习网络.模式识别与人工智能, 2023, 36(5): 471-482. (CHEN J W, CHEN Y Z. Dual View Contrastive Learning Networks for Multi-hop Reading Comprehension. Pattern Recognition and Artificial Intelligence, 2023, 36(5): 471-482.) [26] PAN X, WANG M X, WU L W, et al. Contrastive Learning for Many-to-Many Multilingual Neural Machine Translation // Proc of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing(Long Papers). Stroudsburg, USA: ACL, 2021: 244-258. [27] BROMLEY J, GUYON I, LECUN Y, et al. Signature Verification Using a "Siamese" Time Delay Neural Network // Proc of the 6th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 1993: 737-744. [28] REIMERS N, GUREVYCH I. Sentence-BERT: Sentence Embe-ddings Using Siamese BERT-Networks // Proc of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg, USA: ACL, 2019: 3982-3992. [29] TIEDEMANN J. Parallel Data, Tools and Interfaces in OPUS // Proc of the 8th International Conference on Language Resources and Evaluation. Cambridge, USA: MIT Press, 2012: 2214-2218. [30] RIZA H, PURWOADI M, GUNARSO, et al. Introduction of the Asian Language Treebank // Proc of the Conference of the Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques. Washington, USA: IEEE, 2016. DOI: 10.1109/ICSDA.2016.7918974 [31] PLATT J C. Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines. Technical Report, MSR-TR-98-14. Cambridge, USA: Microsoft Research, 1998. [32] DREISEITL S, OHNO-MACHADO L. Logistic Regression and Artificial Neural Network Classification Models: A Methodology Review. Journal of Biomedical Informatics, 2002, 35(5/6): 352-359. [33] 毛存礼,高旭,余正涛,等.结构特征一致性约束的双语平行句对抽取.重庆大学学报, 2021, 44(1): 46-56. (MAO C L, GAO X, YU Z T, et al. Extraction of Bilingual Para-llel Sentence Pairs Constrained by Consistency of Structural Features. Journal of Chongqing University, 2021, 44(1): 46-56.) [34] FENG F X Y, YANG Y F, CER D, et al. Language-Agnostic BERT Sentence Embedding // Proc of the 60th Annual Meeting of the Association for Computational Linguistics(Long Papers). Stroudsburg, USA: ACL, 2022: 878-891. [35] WANG L, YANG N, HUANG X L, et al. Text Embeddings by Weakly-Supervised Contrastive Pre-Training[C/OL].[2023-2-16]. https://arxiv.org/abs/2212.03533. [36] VASWANI A, SHAZEER N, PARMAR N, et al. Attention Is All You Need // Proc of the 31st International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2017: 6000-6010.