Offline Reinforcement Learning Algorithm Based on Selection of High-Quality Samples
HOU Yonghong1, DING Wang1, REN Yi2, DONG Hongwei2, YANG Songling1
1. School of Electrical and Information Engineering, Tianjin University, Tianjin 300072; 2. National Key Laboratory of Space Integrated Information System, Institute of Software, Chinese Academy of Sciences, Beijing 100190
Abstract:To address the issue of over-reliance on the quality of dataset samples of offline reinforcement learning algorithms, an offline reinforcement learning algorithm based on selection of high-quality samples(SHS) is proposed. In the policy evaluation stage, higher update weights are assigned to the samples with advantage values, and a policy entropy term is added to quickly identify high-quality action samples with high probability within the data distribution, thereby screening out more valuable action samples. In the policy optimization stage, SHS aims to maximize the normalized advantage function while maintaining the policy constraints on the actions within the dataset. Consequently, high-quality samples can be efficiently utilized when the sample quality of the dataset is low, thereby improving the learning efficiency and performance of the strategy. Experiments show that SHS performs well on D4RL offline dataset in the MuJoCo-Gym environment and successfully screens out more valuable samples, thus its effectiveness is verified.
[1] KAELBLING L P, LITTMAN M L, MOORE A W. Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research, 1996, 4(1): 237-285. [2] LANGE S, GABEL T, RIEDMILLER M. Batch Reinforcement Lear-ning // WIERING M, VAN OTTERLO M, eds. Reinforcement Learning: State-of-the-Art. Berlin, Germany: Springer, 2012: 45-73. [3] LEVINE S, KUMAR A, TUCKER G, et al. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems[C/OL].[2024-07-19]. https://arxiv.org/pdf/2005.01643. [4] 张晓明,高士杰,姚昌瑀,等.强化学习及其在机器人任务规划中的进展与分析.模式识别与人工智能, 2023, 36(10): 902-917. (ZHANG X M, GAO S J, YAO C Y, et al. Reinforcement Lear-ning and Its Application in Robot Task Planning: A Survey. Pattern Recognition and Artificial Intelligence, 2023, 36(10): 902-917.) [5] LIU S Q, SEE K C, NGIAM K Y, et al. Reinforcement Learning for Clinical Decision Support in Critical Care: Comprehensive Review. Journal of Medical Internet Research, 2020, 22(7). DOI: 10.2196/18477. [6] KIRAN B R, SOBH I, TALPAERT V, et al. Deep Reinforcement Learning for Autonomous Driving: A Survey. IEEE Transactions on Intelligent Transportation Systems, 2022, 23(6): 4909-4926. [7] AN G, MOON S, KIM J H, et al. Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble // Proc of the 35th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2021: 7436-7447. [8] AKIMOV D, KURENKOV V, NIKULIN A, et al. Let Offline RL Flow: Training Conservative Agents in the Latent Space of Normalizing Flows[C/OL].[2024-07-19]. https://arxiv.org/abs/2211.11096. [9] KUMAR A, FU J, TUCKER G, et al. Stabilizing Off-Policy Q-Lear-ning via Bootstrapping Error Reduction // Proc of the 33rd International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press,, 2019: 11784-11794. [10] WU Y F, TUCKER G, NACHUM O. Behavior Regularized Offline Reinforcement Learning[C/OL]. [2024-07-19].https://arxiv.org/pdf/1911.11361. [11] FUJIMOTO S, MEGER D, PRECUP D. Off-Policy Deep Reinforcement Learning without Exploration. Journal of Machine Lear-ning Research, 2019, 97: 2052-2062. [12] FUJIMOTO S, GU S S. A Minimalist Approach to Offline Reinforcement Learning // Proc of the 35th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2021: 20132-20145. [13] WANG Z Y, NOVIKOV A, ZOŁNA K, et al. Critic Regularized Regression // Proc of the 34th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2020: 7768-7778. [14] XU H R, ZHAN X Y, LI J X, et al. Offline Reinforcement Lear-ning with Soft Behavior Regularization[C/OL].[2024-07-19]. https://arxiv.org/abs/2110.07395. [15] PENG X B, KUMAR A, ZHANG G, et al. Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Lear-ning[C/OL].[2024-07-19]. https://arxiv.org/abs/1910.00177v3. [16] KOSTRIKOV I, NAIR A, LEVINE S. Offline Reinforcement Lear-ning with Implicit Q-Learning[C/OL]. [2024-07-19]. https://arxiv.org/pdf/2110.06169. [17] XIAO C J, WANG H, PAN Y C, et al. The In-Sample Softmax for Offline Reinforcement Learning[C/OL].[2024-07-19]. https://arxiv.org/pdf/2302.14372. [18] GARG D, HEJNA J, GEIST M, et al. Extreme Q-Learning: MaxEnt RL without Entropy[C/OL].[2024-07-19]. https://arxiv.org/pdf/2301.02328. [19] KUMAR A, ZHOU A, TUCKER G, et al. Conservative Q-Lear-ning for Offline Reinforcement Learning // Proc of the 34th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2020: 1179-1191. [20] HUANG L Y, DONG B T, ZHANG W D. Efficient Offline Reinforcement Learning with Relaxed Conservatism. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(8): 5260-5272. [21] KOSTRIKOV I, TOMPSON J, FERGUS R, et al. Offline Reinforcement Learning with Fisher Divergence Critic Regularization. Journal of Machine Learning Research, 2021, 39: 5774-5783. [22] DAOUDI P, DOS S L, BARLIER M, et al. Density Estimation for Conservative Q-Learning[C/OL].[2024-07-19]. https://openreview.net/pdf?id=liV-Re74fK. [23] CHEN X Y, ZHOU Z J, WANG Z, et al. BAIL: Best-Action Imitation Learning for Batch Deep Reinforcement Learning[C/OL].[2024-07-19]. https://openreview.net/pdf?id=BJlnmgrFvS. [24] BRANDFONVRENER D, WHITNEY W F, RANGANATH R, et al. Offline RL without Off-Policy Evaluation // Proc of the 35th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2021: 4933-4946. [25] CHEN L L, LU K, RAJESWARAN A, et al. Decision Transfor-mer: Reinforcement Learning via Sequence Modeling // Proc of the 34th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2021: 15084-15097. [26] JANNER M, LI Q Y, LEVINE S. Offline Reinforcement Learning as One Big Sequence Modeling Problem // Proc of the 34th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2021: 1273-1286. [27] 张政锋,赵彬琦,单洪明,等.问题设定驱动的深度强化学习研究:综述.模式识别与人工智能, 2022, 35(8): 718-742. (ZHANG Z F, ZHAO B Q, SHAN H M, et al. A Survey of Pro-blem Setting-Driven Deep Reinforcement Learning. Pattern Recognition and Artificial Intelligence, 2022, 35(8): 718-742.) [28] PENG Z Y, HAN C L, LIU Y D, et al. Weighted Policy Constraints for Offline Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 2023, 37(8): 9435-9443. [29] 万里鹏,兰旭光,张翰博,等.深度强化学习理论及其应用综述.模式识别与人工智能, 2019, 32(1): 67-81. (WAN L P, LAN X G, ZHANG H B, et al. A Review of Deep Reinforcement Learning Theory and Application. Pattern Recognition and Artificial Intelligence, 2019, 32(1): 67-81.) [30] HAARNOJA T, ZHOU A, HARTIKAINEN K, et al. Soft Actor-Critic Algorithm and Applications[C/OL].[2024-07-19]. https://arxiv.org/pdf/181205905.