基于优质样本筛选的离线强化学习算法

doi:10.16451/j.cnki.issn1003-6059.202411007

摘要
图/表
参考文献
相关文章 (15)

全文: PDF (1530 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要针对离线强化学习算法过度依赖数据集样本质量的问题,提出基于优质样本筛选的离线强化学习算法.首先,在策略评估阶段,赋予优势值的样本更高的更新权重,并添加策略熵项,快速识别高质量且在数据分布内概率较高的动作样本,从而筛选更有价值的动作样本.在策略优化阶段,最大化归一化优势函数的同时,保持对数据集上动作的策略约束,使算法在数据集样本质量较低时也可高效利用优质样本,提升策略的学习效率和性能.实验表明,文中算法在MuJoCo-Gym环境的D4RL离线数据集上表现出色,并且可成功筛选更有价值的样本,由此验证其有效性.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	侯永宏
	丁旺
	任懿
	董洪伟
	杨松领

关键词 ：强化学习, 离线强化学习, 分布偏移, 策略约束, 值函数, 样本筛选

Abstract：To address the issue of over-reliance on the quality of dataset samples of offline reinforcement learning algorithms, an offline reinforcement learning algorithm based on selection of high-quality samples(SHS) is proposed. In the policy evaluation stage, higher update weights are assigned to the samples with advantage values, and a policy entropy term is added to quickly identify high-quality action samples with high probability within the data distribution, thereby screening out more valuable action samples. In the policy optimization stage, SHS aims to maximize the normalized advantage function while maintaining the policy constraints on the actions within the dataset. Consequently, high-quality samples can be efficiently utilized when the sample quality of the dataset is low, thereby improving the learning efficiency and performance of the strategy. Experiments show that SHS performs well on D4RL offline dataset in the MuJoCo-Gym environment and successfully screens out more valuable samples, thus its effectiveness is verified.

Key words： Reinforcement Learning Offline Reinforcement Learning Distribution Shift Policy Constraint Value Function Sample Selection

收稿日期: 2024-08-23

ZTFLH:

TP 18

通讯作者: 任懿,博士,高级工程师,主要研究方向为强化学习、智能博弈.E-mail:renyi_iscas@126.com.

作者简介: 侯永宏,博士,教授,主要研究方向为计算机视觉、视频与图像处理、数字通信.E-mail:houroy@tju.edu.cn. 丁旺,硕士研究生,主要研究方向为离线强化学习、人工智能.E-mail:2022234138@tju.edu.cn. 董洪伟,博士,助理研究员,主要研究方向为机器学习、模式识别.E-mail:donghongwei1994@163.com. 杨松领,硕士研究生,主要研究方向为强化学习、人工智能.E-mail: 2022234112@tju.edu.cn.

引用本文:

侯永宏, 丁旺, 任懿, 董洪伟, 杨松领. 基于优质样本筛选的离线强化学习算法[J]. 模式识别与人工智能, 2024, 37(11): 1022-1032. HOU Yonghong, DING Wang, REN Yi, DONG Hongwei, YANG Songling. Offline Reinforcement Learning Algorithm Based on Selection of High-Quality Samples. Pattern Recognition and Artificial Intelligence, 2024, 37(11): 1022-1032.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/10.16451/j.cnki.issn1003-6059.202411007 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2024/V37/I11/1022