基于优质样本筛选的离线强化学习算法

doi:10.16451/j.cnki.issn1003-6059.202411007

Abstract
Figure/Table
References
Related Citation (15)

Download: PDF (1530 KB) HTML (1 KB)
Export: BibTeX | EndNote (RIS)

Abstract To address the issue of over-reliance on the quality of dataset samples of offline reinforcement learning algorithms, an offline reinforcement learning algorithm based on selection of high-quality samples(SHS) is proposed. In the policy evaluation stage, higher update weights are assigned to the samples with advantage values, and a policy entropy term is added to quickly identify high-quality action samples with high probability within the data distribution, thereby screening out more valuable action samples. In the policy optimization stage, SHS aims to maximize the normalized advantage function while maintaining the policy constraints on the actions within the dataset. Consequently, high-quality samples can be efficiently utilized when the sample quality of the dataset is low, thereby improving the learning efficiency and performance of the strategy. Experiments show that SHS performs well on D4RL offline dataset in the MuJoCo-Gym environment and successfully screens out more valuable samples, thus its effectiveness is verified.

Key words： Reinforcement Learning Offline Reinforcement Learning Distribution Shift Policy Constraint Value Function Sample Selection

Received: 23 August 2024

ZTFLH:

TP 18

Corresponding Authors: REN Yi, Ph.D., senior engineer. His research interests include reinforcement learning and intelligent game.

About author:: HOU Yonghong, Ph.D., professor. His research interests include computer vision, video and image processing, and digital communication.DING Wang, Master student. His research interests include offline reinforcement learning and artificial intelligence.DONG Hongwei, Ph.D., assistant professor. His research interests include machine learning and pattern recognition.YANG Songling, Master student. His research interests include reinforcement lear-ning and artificial intelligence.

	Service

	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	HOU Yonghong
	DING Wang
	REN Yi
	DONG Hongwei
	YANG Songling

Cite this article:

HOU Yonghong,DING Wang,REN Yi等. Offline Reinforcement Learning Algorithm Based on Selection of High-Quality Samples[J]. Pattern Recognition and Artificial Intelligence, 2024, 37(11): 1022-1032.

URL:

http://manu46.magtech.com.cn/Jweb_prai/EN/10.16451/j.cnki.issn1003-6059.202411007 OR http://manu46.magtech.com.cn/Jweb_prai/EN/Y2024/V37/I11/1022

[1] KAELBLING L P, LITTMAN M L, MOORE A W. Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research, 1996, 4(1): 237-285.
[2] LANGE S, GABEL T, RIEDMILLER M. Batch Reinforcement Lear-ning // WIERING M, VAN OTTERLO M, eds. Reinforcement Learning: State-of-the-Art. Berlin, Germany: Springer, 2012: 45-73.
[3] LEVINE S, KUMAR A, TUCKER G, et al. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems[C/OL].[2024-07-19]. https://arxiv.org/pdf/2005.01643.
[4] 张晓明,高士杰,姚昌瑀,等.强化学习及其在机器人任务规划中的进展与分析.模式识别与人工智能, 2023, 36(10): 902-917.
(ZHANG X M, GAO S J, YAO C Y, et al. Reinforcement Lear-ning and Its Application in Robot Task Planning: A Survey. Pattern Recognition and Artificial Intelligence, 2023, 36(10): 902-917.)
[5] LIU S Q, SEE K C, NGIAM K Y, et al. Reinforcement Learning for Clinical Decision Support in Critical Care: Comprehensive Review. Journal of Medical Internet Research, 2020, 22(7). DOI: 10.2196/18477.
[6] KIRAN B R, SOBH I, TALPAERT V, et al. Deep Reinforcement Learning for Autonomous Driving: A Survey. IEEE Transactions on Intelligent Transportation Systems, 2022, 23(6): 4909-4926.
[7] AN G, MOON S, KIM J H, et al. Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble // Proc of the 35th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2021: 7436-7447.
[8] AKIMOV D, KURENKOV V, NIKULIN A, et al. Let Offline RL Flow: Training Conservative Agents in the Latent Space of Normalizing Flows[C/OL].[2024-07-19]. https://arxiv.org/abs/2211.11096.
[9] KUMAR A, FU J, TUCKER G, et al. Stabilizing Off-Policy Q-Lear-ning via Bootstrapping Error Reduction // Proc of the 33rd International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press,, 2019: 11784-11794.
[10] WU Y F, TUCKER G, NACHUM O. Behavior Regularized Offline Reinforcement Learning[C/OL]. [2024-07-19].https://arxiv.org/pdf/1911.11361.
[11] FUJIMOTO S, MEGER D, PRECUP D. Off-Policy Deep Reinforcement Learning without Exploration. Journal of Machine Lear-ning Research, 2019, 97: 2052-2062.
[12] FUJIMOTO S, GU S S. A Minimalist Approach to Offline Reinforcement Learning // Proc of the 35th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2021: 20132-20145.
[13] WANG Z Y, NOVIKOV A, ZOŁNA K, et al. Critic Regularized Regression // Proc of the 34th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2020: 7768-7778.
[14] XU H R, ZHAN X Y, LI J X, et al. Offline Reinforcement Lear-ning with Soft Behavior Regularization[C/OL].[2024-07-19]. https://arxiv.org/abs/2110.07395.
[15] PENG X B, KUMAR A, ZHANG G, et al. Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Lear-ning[C/OL].[2024-07-19]. https://arxiv.org/abs/1910.00177v3.
[16] KOSTRIKOV I, NAIR A, LEVINE S. Offline Reinforcement Lear-ning with Implicit Q-Learning[C/OL]. [2024-07-19]. https://arxiv.org/pdf/2110.06169.
[17] XIAO C J, WANG H, PAN Y C, et al. The In-Sample Softmax for Offline Reinforcement Learning[C/OL].[2024-07-19]. https://arxiv.org/pdf/2302.14372.
[18] GARG D, HEJNA J, GEIST M, et al. Extreme Q-Learning: MaxEnt RL without Entropy[C/OL].[2024-07-19]. https://arxiv.org/pdf/2301.02328.
[19] KUMAR A, ZHOU A, TUCKER G, et al. Conservative Q-Lear-ning for Offline Reinforcement Learning // Proc of the 34th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2020: 1179-1191.
[20] HUANG L Y, DONG B T, ZHANG W D. Efficient Offline Reinforcement Learning with Relaxed Conservatism. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(8): 5260-5272.
[21] KOSTRIKOV I, TOMPSON J, FERGUS R, et al. Offline Reinforcement Learning with Fisher Divergence Critic Regularization. Journal of Machine Learning Research, 2021, 39: 5774-5783.
[22] DAOUDI P, DOS S L, BARLIER M, et al. Density Estimation for Conservative Q-Learning[C/OL].[2024-07-19]. https://openreview.net/pdf?id=liV-Re74fK.
[23] CHEN X Y, ZHOU Z J, WANG Z, et al. BAIL: Best-Action Imitation Learning for Batch Deep Reinforcement Learning[C/OL].[2024-07-19]. https://openreview.net/pdf?id=BJlnmgrFvS.
[24] BRANDFONVRENER D, WHITNEY W F, RANGANATH R, et al. Offline RL without Off-Policy Evaluation // Proc of the 35th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2021: 4933-4946.
[25] CHEN L L, LU K, RAJESWARAN A, et al. Decision Transfor-mer: Reinforcement Learning via Sequence Modeling // Proc of the 34th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2021: 15084-15097.
[26] JANNER M, LI Q Y, LEVINE S. Offline Reinforcement Learning as One Big Sequence Modeling Problem // Proc of the 34th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2021: 1273-1286.
[27] 张政锋,赵彬琦,单洪明,等.问题设定驱动的深度强化学习研究:综述.模式识别与人工智能, 2022, 35(8): 718-742.
(ZHANG Z F, ZHAO B Q, SHAN H M, et al. A Survey of Pro-blem Setting-Driven Deep Reinforcement Learning. Pattern Recognition and Artificial Intelligence, 2022, 35(8): 718-742.)
[28] PENG Z Y, HAN C L, LIU Y D, et al. Weighted Policy Constraints for Offline Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 2023, 37(8): 9435-9443.
[29] 万里鹏,兰旭光,张翰博,等.深度强化学习理论及其应用综述.模式识别与人工智能, 2019, 32(1): 67-81.
(WAN L P, LAN X G, ZHANG H B, et al. A Review of Deep Reinforcement Learning Theory and Application. Pattern Recognition and Artificial Intelligence, 2019, 32(1): 67-81.)
[30] HAARNOJA T, ZHOU A, HARTIKAINEN K, et al. Soft Actor-Critic Algorithm and Applications[C/OL].[2024-07-19]. https://arxiv.org/pdf/181205905.