基于孪生对比网络的汉语-东南亚语言多语言平行句对抽取

doi:10.16451/j.cnki.issn1003-6059.202310006

摘要
图/表
参考文献
相关文章 (7)

全文: PDF (1114 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要平行句对抽取应用在东南亚稀缺资源语言上性能不佳,主要原因在于缺少训练语料,导致句对抽取模型表征能力较差.因此,文中提出基于孪生对比网络的汉语-东南亚语言多语言平行句对抽取方法,从模型结构、训练策略与数据三方面提升性能.首先,提出孪生对比网络框架,将对比学习思想应用到孪生网络中,增强模型对平行句对的表征能力.然后,引入相似语言联合训练策略,有效进行知识共享,提高模型的学习能力.最后,通过多语言词替换的方式构造汉语-混合东南亚语言平行句对,为训练提供较充分的样本信息.在汉语-泰语和汉语-老挝语数据集上的实验表明,文中方法可有效提升平行句对抽取性能.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	周远卓
	毛存礼
	沈政
	张思琦
	余正涛
	王振晗

关键词 ：平行句对抽取, 对比学习, 联合训练, 孪生网络

Abstract：The poor performance of parallel sentence pair extraction application on Southeast Asian languages with scarce resources is primarily due to the weak representation capabilities of the sentence pair extraction models caused by the lack of training corpora. Therefore, a siamese contrastive network based multilingual parallel sentence pair extraction between Chinese and Southeast Asian languages is proposed to optimize model structure, training strategy and data. Firstly, a siamese contrastive network framework is employed, integrating contrastive learning concept into the siamese network to enhance the representation capability for parallel sentence pairs. Next, a strategy of joint training with similar languages is introduced to share knowledge effectively and improve the learning ability of the model. Finally, Chinese-mixed Southeast Asian parallel sentence pairs are constructed by multilingual word replacement, providing abundant sample information for training. Experiments on Chinese-Thai and Chinese-Lao datasets demonstrate that the proposed method effectively enhances the performance of parallel sentence pair extraction.

Key words： Parallel Sentence Pair Extraction Contrastive Learning Joint Training Siamese Network

收稿日期: 2023-09-06

ZTFLH:

TP391.1

基金资助:国家自然科学基金项目(No.62166023,U21B2027,61972186)、云南省科技重大专项项目(No.202103AA080015,202203AA080004,202302AD080003)、云南省基础研究计划项目(No.202301AT070471)资助

通讯作者: 毛存礼,博士,教授,主要研究方向为自然语言处理、信息检索、机器翻译.E-mail:maocunli@163.com.

作者简介: 周远卓,硕士研究生,主要研究方向为自然语言处理、机器翻译.E-mail:550132791@qq.com.沈政,硕士研究生,主要研究方向为自然语言处理、机器翻译.E-mail:1591744723@qq.com.张思琦,博士研究生,主要研究方向为自然语言处理、机器翻译.E-mail:943714686@qq.com.余正涛,博士,教授,主要研究方向为自然语言处理、信息检索、机器翻译.E-mail:ztyu@hotmail.com.王振晗,博士研究生,主要研究方向为自然语言处理、机器翻译.E-mail:wangzhenhan93@gmail.com.

引用本文:

周远卓, 毛存礼, 沈政, 张思琦, 余正涛, 王振晗. 基于孪生对比网络的汉语-东南亚语言多语言平行句对抽取[J]. 模式识别与人工智能, 2023, 36(10): 931-941. ZHOU Yuanzhuo, MAO Cunli, SHEN Zheng, ZHANG Siqi, YU Zhengtao, WANG Zhenhan. Siamese Contrastive Network Based Multilingual Parallel Sentence Pair Extraction between Chinese and Southeast Asian Languages. Pattern Recognition and Artificial Intelligence, 2023, 36(10): 931-941.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/10.16451/j.cnki.issn1003-6059.202310006 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2023/V36/I10/931