...
首页> 外文期刊>JMLR: Workshop and Conference Proceedings >A Hybrid Stochastic Policy Gradient Algorithm for Reinforcement Learning
【24h】

A Hybrid Stochastic Policy Gradient Algorithm for Reinforcement Learning

机译:一种钢筋学习的混合随机政策梯度算法

获取原文
           

摘要

We propose a novel hybrid stochastic policy gradient estimator by combining an unbiased policy gradient estimator, the REINFORCE estimator, with another biased one, an adapted SARAH estimator for policy optimization. The hybrid policy gradient estimator is shown to be biased, but has variance reduced property. Using this estimator, we develop a new Proximal Hybrid Stochastic Policy Gradient Algorithm (ProxHSPGA) to solve a composite policy optimization problem that allows us to handle constraints or regularizers on the policy parameters. We first propose a single-looped algorithm then introduce a more practical restarting variant. We prove that both algorithms can achieve the best-known trajectory complexity to attain a first-order stationary point for the composite problem which is better than existing REINFORCE/GPOMDP and SVRPG in the non-composite setting. We evaluate the performance of our algorithm on several well-known examples in reinforcement learning. Numerical results show that our algorithm outperforms two existing methods on these examples. Moreover, the composite settings indeed have some advantages compared to the non-composite ones on certain problems.
机译:我们通过组合无偏策略梯度估计器,加强估计器,另一个偏置的一个改进的SARAH估计器来提出一种新的混合随机政策梯度估计器,适用于政策优化。混合策略梯度估计器显示被偏置,但具有方差减少了属性。使用此估算器,我们开发了一种新的近端混合随机策略梯度算法(Proxhspga),以解决复合策略优化问题,允许我们处理策略参数上的约束或常规程序。我们首先提出单环路算法,然后引入更实用的重启变体。我们证明这两种算法都可以实现最着名的轨迹复杂性以获得用于复合问题的一阶静止点,这比在非复合装置中的现有增强/ GPOMDP和SVRPG更好。我们评估我们算法在钢筋学习中的几个知名例子上的性能。数值结果表明,我们的算法优于这些示例的两种现有方法。此外,与某些问题的非复合材料相比,复合设置确实具有一些优点。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号