...
首页> 外文期刊>JMLR: Workshop and Conference Proceedings >A Nonparametric Off-Policy Policy Gradient
【24h】

A Nonparametric Off-Policy Policy Gradient

机译:非参数脱离政策政策梯度

获取原文
           

摘要

Reinforcement learning (RL) algorithms still suffer from high sample complexity despite outstanding recent successes. The need for intensive interactions with the environment is especially observed in many widely popular policy gradient algorithms that perform updates using on-policy samples. The priceof such inefficiency becomes evident in real world scenarios such as interaction-driven robot learning, where the success of RL has been rather limited. We address this issue by building on the general sample efficiency of off-policy algorithms. With nonparametric regression and density estimation methods we construct a nonparametric Bellman equation in a principled manner, which allows us to obtain closed-form estimates of the value function, and to analytically express the full policy gradient. We provide a theoretical analysis of our estimate to show that it is consistent under mild smoothness assumptions and empirically show that our approach has better sample efficiency than state-of-the-art policy gradient methods.
机译:尽管最近的成功,加固学习(RL)算法仍然患有高样本复杂性。在许多广泛流行的政策梯度算法中尤其观察到与环境密集相互作用的需要,这些梯度算法使用策略样本执行更新。在互动驱动的机器人学习等现实世界场景中,这种效率的估计变得明显,RL的成功相当有限。我们通过建立截止策略算法的一般样本效率来解决这个问题。具有非参数回归和密度估计方法,我们以原则的方式构造非参数贝尔曼方程,允许我们获得价值函数的闭合估计,并分析完全政策梯度。我们提供了对我们估算的理论分析,以表明它在轻度平滑假设和经验上是一致的,并且我们的方法具有比最先进的政策梯度方法更好的样本效率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号