Abstract Policy gradient methods in reinforcement learning are widely applied to continuous decision-making problems due to their generality. However, their practical performance is consistently constrained by low sample utilization caused by high gradient variance. In this paper, a Hessian-aided probabilistic policy gradient method(HAPPG) is proposed, and a bimodal gradient estimation mechanism is designed based on the probabilistic gradient estimator. Historical momentum is added to large-batch-size estimation to restrict optimization fluctuation of gradient descent, and variance-reduced estimation based on the Hessian-aided technique is constructed by introducing the second-order curvature information of the policy parameters into the small-batch-size estimation. Theoretical analysis demonstrates that HAPPG achieves an O(ϵ-3) sample complexity under non-convex optimization conditions, attaining the best convergence rate among the existing methods. Experimental results validate its superior performance across multiple benchmark control tasks. Furthermore, the Hessian-aided probabilistic policy gradient estimator is combined with the proximal policy optimization(PPO) by embedding the adaptive learning rate mechanism of Adam optimizer, resulting in HAP-PPO. HAP-PPO outperforms PPO, and the designed gradient estimator can be applied to further enhance mainstream reinforcement learning algorithms.
Fund:National Natural Science Foundation of China(No.62073294,U2341216)
Corresponding Authors:
LI Yongqiang, Ph.D., associate professor. His research interests include reinforcement learning and control theory.
About author:: HU Lei, Master student. His research interests include reinforcement learning and intelligent games. FENG Yu, Ph.D., professor. His research interests include multi-agent games, deep reinforcement learning, optimal and robust control. FENG Yuanjing, Ph.D., professor. His research interests include medical image processing, machine vision and brain intelligence.
[1] SUTTON R S, MCALLESTER D, SINGH S, et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation // Proc of the 13th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 1999: 1057-1063. [2] WILLIAMS R J. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 1992, 8: 229-256. [3] KONDA V R, TSITSIKLIS J N. Actor-Critic Algorithms // Proc of the 13th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 1999: 1008-1014. [4] SCHULMAN J, MORITZ P, LEVINE S, et al. High-Dimensional Continuous Control Using Generalized Advantage Estimation[C/OL].[2024-10-17]. https://arxiv.org/pdf/1506.02438. [5] SCHULMAN J, LEVINE S, MORITZ P, et al. Trust Region Policy Optimization. Proceedings of Machine Learning Research, 2015, 37: 1889-1897. [6] SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Proximal Policy Optimization Algorithms[C/OL].[2024-10-17]. https://arxiv.org/pdf/1707.06347. [7] JOHNSON R, ZHANG T. Accelerating Stochastic Gradient Descent Using Predictive Variance Reduction // Proc of the 27th Internatio-nal Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2013, I: 315-323. [8] LE ROUX N, SCHMIDT M, BACH F. A Stochastic Gradient Me-thod with an Exponential Convergence Rate for Finite Training Sets // Proc of the 26th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2012, II: 2663-2671. [9] NGUYEN L M, LIU J, SCHEINBERG K, et al. SARAH: A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient. Proceedings of Machine Learning Research, 2017, 70: 2613-2621. [10] FANG C, LI C J, LIN Z C, et al. SPIDER: Near-Optimal Non-convex Optimization via Stochastic Path-Integrated Differential Estimator // Proc of the 32nd International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2018: 687-697. [11] CUTKOSKY A, ORABONA F. Momentum-Based Variance Reduction in Non-convex SGD // Proc of the 33rd International Confe-rence on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2019: 15236-15245. [12] LI Z Z, BAO H Y, ZHANG X L, et al. PAGE: A Simple and Optimal Probabilistic Gradient Estimator for Nonconvex Optimization. Proceedings of Machine Learning Research, 2021, 139: 6286-6295. [13] PAPINI M, BINAGHI D, CANONACO G, et al. Stochastic Variance-Reduced Policy Gradient. Proceedings of Machine Learning Research, 2018, 80: 4026-4035. [14] SHEN Z B, RIBEIRO A, HASSANI H, et al. Hessian Aided Policy Gradient. Proceedings of Machine Learning Research, 2019, 97: 5729-5738. [15] BAXTER J, BARTLETT P L. Infinite-Horizon Policy-Gradient Estimation. Journal of Artificial Intelligence Research, 2001, 15(1): 319-350. [16] KARIMI H, NUTINI J, SCHMIDT M. Linear Convergence of Gradient and Proximal-Gradient Methods under the Polyak-Ojasiewicz Condition // Proc of the European Conference on Machine Learning and Knowledge Discovery in Databases. Berlin, Germany: Sprin-ger, 2016: 795-811. [17] FURMSTON T, BARBER D. A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes // Proc of the 26th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2012, II: 2717-2725. [18] PIROTTA M, RESTELLI M, BASCETTA L. Policy Gradient in Lipschitz Markov Decision Processes. Machine Learning, 2015, 100: 255-283. [19] YANG L, ZHANG Y, ZHENG G, et al. Policy Optimization with Stochastic Mirror Descent. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(8): 8823-8831. [20] YUAN H Z, LIAN X R, LIU J, et al. Stochastic Recursive Momentum for Policy Gradient Methods[C/OL].[2024-10-17]. https://arxiv.org/pdf/2003.04302. [21] HUANG F H, GAO S Q, PEI J, et al. Momentum-Based Policy Gradient Methods. Proceedings of Machine Learning Research, 2020, 119: 4422-4433. [22] SALEHKALEYBAR S, KHORASANI S, KIYAVASH N, et al. Momentum-Based Policy Gradient with Second-Order Information[C/OL].[2024-10-17]. https://arxiv.org/pdf/2205.08253v2. [23] GARGIANI M, ZANELLI A, MARTINELLI A, et al. PAGE-PG: A Simple and Loopless Variance-Reduced Policy Gradient Method with Probabilistic Gradient Estimation. Proceedings of Machine Learning Research, 2022, 162: 7223-7240. [24] ZHANG J Y, NI C Z, YU Z, et al. On the Convergence and Sample Efficiency of Variance-Reduced Policy Gradient Method // Proc of the 35th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2021: 2228-2240.