模式识别与人工智能
2025年5月2日 星期五   首 页     期刊简介     编委会     投稿指南     伦理声明     联系我们                                                                English
模式识别与人工智能  2025, Vol. 38 Issue (2): 177-191    DOI: 10.16451/j.cnki.issn1003-6059.202502006
研究与应用 最新目录| 下期目录| 过刊浏览| 高级检索 |
海森辅助的概率策略梯度方法
胡磊1, 李永强1, 冯宇1, 冯远静1
1.浙江工业大学 信息工程学院 杭州 310023
Hessian Aided Probabilistic Policy Gradient Method
HU Lei1, LI Yongqiang1, FENG Yu1, FENG Yuanjing1
1. College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023

全文: PDF (1138 KB)   HTML (1 KB) 
输出: BibTeX | EndNote (RIS)      
摘要 强化学习中的策略梯度方法因其通用性而广泛应用于连续决策问题,但高梯度方差导致的低样本利用率始终制约其实际应用性能.文中提出海森辅助的概率策略梯度方法(Hessian Aided Probabilistic Policy Gradient Method, HAPPG),在PAGE(Probabilistic Gradient Estimator)的基础上设计双模态梯度估计机制:在大批量估计中增加历史动量,限制梯度下降的波动性;在小批量估计中引入策略参数空间的二阶曲率信息,构建基于海森矩阵辅助技术的方差缩减估计形式.理论分析表明,HAPPG在非凸优化条件下达到O(ϵ-3)的样本复杂度,在多项基准控制任务上性能较优.更进一步,通过嵌入Adam优化器的自适应学习率机制,将海森辅助概率策略梯度估计器与PPO(Proximal Policy Optimization)结合,提出HAP-PPO,性能优于PPO,且文中设计的梯度估计器能进一步提升主流强化学习方法性能.
服务
把本文推荐给朋友
加入我的书架
加入引用管理器
E-mail Alert
RSS
作者相关文章
胡磊
李永强
冯宇
冯远静
关键词 机器学习强化学习策略梯度方差缩减    
Abstract:Policy gradient methods in reinforcement learning are widely applied to continuous decision-making problems due to their generality. However, their practical performance is consistently constrained by low sample utilization caused by high gradient variance. In this paper, a Hessian-aided probabilistic policy gradient method(HAPPG) is proposed, and a bimodal gradient estimation mechanism is designed based on the probabilistic gradient estimator. Historical momentum is added to large-batch-size estimation to restrict optimization fluctuation of gradient descent, and variance-reduced estimation based on the Hessian-aided technique is constructed by introducing the second-order curvature information of the policy parameters into the small-batch-size estimation. Theoretical analysis demonstrates that HAPPG achieves an O(ϵ-3) sample complexity under non-convex optimization conditions, attaining the best convergence rate among the existing methods. Experimental results validate its superior performance across multiple benchmark control tasks. Furthermore, the Hessian-aided probabilistic policy gradient estimator is combined with the proximal policy optimization(PPO) by embedding the adaptive learning rate mechanism of Adam optimizer, resulting in HAP-PPO. HAP-PPO outperforms PPO, and the designed gradient estimator can be applied to further enhance mainstream reinforcement learning algorithms.
Key wordsMachine Learning    Reinforcement Learning    Policy Gradient    Variance Reduction   
收稿日期: 2024-11-20     
ZTFLH: TP18  
基金资助:国家自然科学基金项目(No.62073294,U2341216) 资助
通讯作者: 李永强,博士,副教授,主要研究方向为强化学习、控制理论.E-mail:yqli@zjut.edu.cn.   
作者简介: 胡 磊,硕士研究生,主要研究方向为深度强化学习、智能博弈.E-mail:211122030031@zjut.edu.cn. 冯 宇,博士,教授,主要研究方向为多智能体博弈、深度强化学习、最优与鲁棒控制.E-mail:yfeng@zjut.cedu.cn. 冯远静,博士,教授,主要研究方向为医学影像处理、机器视觉、脑机智能.E-mail:fyjing@zjut.edu.cn.
引用本文:   
胡磊, 李永强, 冯宇, 冯远静. 海森辅助的概率策略梯度方法[J]. 模式识别与人工智能, 2025, 38(2): 177-191. HU Lei, LI Yongqiang, FENG Yu, FENG Yuanjing. Hessian Aided Probabilistic Policy Gradient Method. Pattern Recognition and Artificial Intelligence, 2025, 38(2): 177-191.
链接本文:  
http://manu46.magtech.com.cn/Jweb_prai/CN/10.16451/j.cnki.issn1003-6059.202502006      或     http://manu46.magtech.com.cn/Jweb_prai/CN/Y2025/V38/I2/177
版权所有 © 《模式识别与人工智能》编辑部
地址:安微省合肥市蜀山湖路350号 电话:0551-65591176 传真:0551-65591176 Email:bjb@iim.ac.cn
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn