基于标签的半监督HDP文本分类主题模型

doi:10.16451/j.cnki.issn1003-6059.201712010

摘要
图/表
参考文献
相关文章 (15)

全文: PDF (579 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要分层狄利克雷过程(HDP)主题模型从数据中自动学习结构最优的主题集,但往往不满足实际语义要求,而现有的一些带标签的主题模型又需要设定很难界定的参数.因此,文中在已知部分语义标签和标签确定度的基础上,分别提出半监督HDP主题模型(SLHDP)和随机簇的准确度评价指标.该模型为已知的语义标签赋予较高权重,结合狄利克雷过程有限空间无线划分的特性,并通过中国餐馆过程建模生成.在多个中英文数据集中的实验表明,在大规模数据集的文本分类中,SLHDP模型能够使主题集的构成更合理.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	李永忠
	郑滔

关键词 ：标签, 半监督, 分层狄利克雷过程(HDP), 主题模型, 随机簇

Abstract：The optimal structure of theme set can be automatically learned from the data with Hierarchical Dirichlet Process(HDP) topic model. However, the set of topics can not meet the semantic requirement. And in some theme models with labels it is difficult to set the parameters. Therefore, based on the known semantic labels and the certitude degree of labels, a semi-supervised labeled HDP topic model(SLHDP) and the accuracy evaluation index of random cluster are proposed in this paper. Higher weight is given by the known semantic labels. Combined with the property of the finite space being divided infinitely in Dirichlet process, the model is built via Chinese restaurant process. The experimental results on several Chinese and English datasets show that SLHDP model makes the topic set more reasonable in the text classification of large scale datasets.

Key words： Label Semi-supervised Hierarchical Dirichlet Process(HDP) Topic Model Random Cluster

收稿日期: 2016-12-29

ZTFLH:

TP 391.1

作者简介: 李永忠,男,1963年生,硕士,副教授,主要研究方向为信息管理、电子政务系统.E-mail:463898002@qq.com.
郑滔(通讯作者),男,1992年生,硕士研究生,主要研究方向为机器学习.E-mail:1191915231@qq.com.

引用本文:

李永忠，郑滔. 基于标签的半监督HDP文本分类主题模型[J]. 模式识别与人工智能, 2017, 30(12): 1138-1148. LI Yongzhong, ZHENG Tao. Semi-supervised Labeled Hierarchical Dirichlet Process Topic Model for Document Categorization. , 2017, 30(12): 1138-1148.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/10.16451/j.cnki.issn1003-6059.201712010 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2017/V30/I12/1138