高度不平衡数据的代价敏感随机森林分类算法

doi:10.16451/j.cnki.issn1003-6059.202003006

摘要
图/表
参考文献
相关文章 (12)

全文: PDF (886 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要在处理高度不平衡数据时,代价敏感随机森林算法存在自助法采样导致小类样本学习不充分、大类样本占比较大、容易削弱代价敏感机制等问题.文中通过对大类样本聚类后,多次采用弱平衡准则对每个集群进行降采样,使选择的大类样本与原训练集的小类样本融合生成多个新的不平衡数据集,用于代价敏感决策树的训练.由此提出基于聚类的弱平衡代价敏感随机森林算法,不仅使小类样本得到充分学习,同时通过降低大类样本数量,保证代价敏感机制受其影响较小.实验表明,文中算法在处理高度不平衡数据集时性能较优.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	平瑞
	周水生
	李冬

关键词 ：不平衡数据, 聚类采样, 代价敏感学习, 随机森林

Abstract：For highly unbalanced data, insufficient learning of minority class samples is caused by self-sampling method of the traditional cost sensitive random forest algorithm, and the cost sensitive mechanism of the algorithm is easily weakened by the large proportion of majority class samples. Therefore, a weak balance cost sensitive random forest algorithm based on clustering is proposed. After clustering the majority class samples, the weak balance criterion is used to reduce the samples of each cluster repeatedly. The selected majority class samples and the minority class samples of the original training set are fused to generate a number of new unbalanced datasets for the training of cost sensitive decision tree. The proposed algorithm not only enables the minority class samples to be fully learned, but also ensures that the cost sensitive mechanism is less affected by reducing the majority class samples. Experiment indicates the better performance of the proposed algorithm in processing highly unbalanced datasets.

Key words： Imbalanced Data Cluster Sampling Cost Sensitive Learning Random Forest

收稿日期: 2019-08-19

ZTFLH:

TP 181

基金资助:国家自然科学基金项目(No.61772020)资助

通讯作者: 周水生,博士,教授,主要研究方向为数据挖掘、机器学习.E-mail:sszhou@mail.xidian.edu.cn.

作者简介: 平瑞,硕士研究生,主要研究方向为数据挖掘、机器学习.E-mail:1246187617@qq.com. 李冬,硕士研究生,主要研究方向为数据挖掘、机器学习.E-mail:lidong_xidian@foxmail.com.

引用本文:

平瑞, 周水生, 李冬. 高度不平衡数据的代价敏感随机森林分类算法[J]. 模式识别与人工智能, 2020, 33(3): 249-257. PING Rui, ZHOU Shuisheng, LI Dong. Cost Sensitive Random Forest Classification Algorithm for Highly Unbalanced Data. , 2020, 33(3): 249-257.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/10.16451/j.cnki.issn1003-6059.202003006 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2020/V33/I3/249