基于字词特征自注意力学习的社交媒体文本分类方法

doi:10.16451/j.cnki.issn1003-6059.202004001

摘要
图/表
参考文献
相关文章 (15)

全文: PDF (654 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要社交媒体文本中突出的长尾效应和过量的词典外词汇(OOV)导致严重的特征稀疏问题,影响分类模型的准确率.针对此问题,文中提出基于字词特征自注意力学习的社交媒体文本分类方法.在字级别构建全局特征,用于学习文本中各词的注意力权值分布.改进现有的多头注意力机制,降低参数规模和计算复杂度.为了更好地分析字词特征融合的作用,提出OOV词汇敏感度,用于衡量不同类型的特征受OOV词汇的影响.多组社交媒体文本分类任务的实验表明,文中方法在融合字特征和词特征方面的有效性与分类准确度均有较明显的提升.此外,OOV词汇敏感度指标的量化结果验证文中方法是可行有效的.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	王晓莉
	叶东毅

关键词 ：社交媒体文本分类, 自注意力机制, 字词特征融合, 词典外词汇敏感度

Abstract：Long tail effect and excessive out-of-vocabulary(OOV) words in social media texts result in severe feature sparsity and reduce classification accuracy. To solve the problem, a social media text classification method based on character-word feature self-attention learning is proposed. Global features are constructed at the character level to learn attention weight distribution, and the existing multi-head attention mechanism is improved to reduce parameter scale and computational complexity. To further analyze character-word feature fusion, OOV sensitivity is proposed to measure the impact of OOV words on different types of features. Experiments on several social media text classification tasks indicate that the effectiveness and classification accuracy of the proposed method are obviously improved in terms of fusing word features and character features. Moreover, the quantitative results of OOV vocabulary sensitivity index verify the feasiblity and effectiveness of the proposed method.

Key words： Social Media Text Classification Self-attention Learning Character-Word Feature Fusion Out of Vocabulary Sensitivity

收稿日期: 2020-01-02

ZTFLH:

TP 391

基金资助:国家自然科学基金项目(No. 61672158)、福建省高校产学合作科技项目(No.2018H6010)资助

通讯作者: 叶东毅,博士,教授,主要研究方向为计算智能、数据挖掘、自然语言处理.Email:yiedy@fzu.edu.cn.

作者简介: 王晓莉,博士研究生,主要研究方向为计算智能、自然语言处理.Email:wxl_zhile@sina.cn.

引用本文:

王晓莉, 叶东毅. 基于字词特征自注意力学习的社交媒体文本分类方法[J]. 模式识别与人工智能, 2020, 33(4): 287-294. WANG Xiaoli, YE Dongyi. Social Media Text Classification Method Based on Character-Word Feature Self-attention Learning. , 2020, 33(4): 287-294.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/10.16451/j.cnki.issn1003-6059.202004001 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2020/V33/I4/287