A Nonparametric Off-Policy Policy Gradient

Samuele Tosatto; Joao Carvalho; Hany Abdulsamad; Jan Peters

首页> 外文期刊>JMLR: Workshop and Conference Proceedings >A Nonparametric Off-Policy Policy Gradient

【24h】

A Nonparametric Off-Policy Policy Gradient

机译：非参数脱离政策政策梯度

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Reinforcement learning (RL) algorithms still suffer from high sample complexity despite outstanding recent successes. The need for intensive interactions with the environment is especially observed in many widely popular policy gradient algorithms that perform updates using on-policy samples. The priceof such inefficiency becomes evident in real world scenarios such as interaction-driven robot learning, where the success of RL has been rather limited. We address this issue by building on the general sample efficiency of off-policy algorithms. With nonparametric regression and density estimation methods we construct a nonparametric Bellman equation in a principled manner, which allows us to obtain closed-form estimates of the value function, and to analytically express the full policy gradient. We provide a theoretical analysis of our estimate to show that it is consistent under mild smoothness assumptions and empirically show that our approach has better sample efficiency than state-of-the-art policy gradient methods.

机译：尽管最近的成功，加固学习（RL）算法仍然患有高样本复杂性。在许多广泛流行的政策梯度算法中尤其观察到与环境密集相互作用的需要，这些梯度算法使用策略样本执行更新。在互动驱动的机器人学习等现实世界场景中，这种效率的估计变得明显，RL的成功相当有限。我们通过建立截止策略算法的一般样本效率来解决这个问题。具有非参数回归和密度估计方法，我们以原则的方式构造非参数贝尔曼方程，允许我们获得价值函数的闭合估计，并分析完全政策梯度。我们提供了对我们估算的理论分析，以表明它在轻度平滑假设和经验上是一致的，并且我们的方法具有比最先进的政策梯度方法更好的样本效率。

著录项

来源
《JMLR: Workshop and Conference Proceedings》 |2020年第2010期|共11页
作者
Samuele Tosatto; Joao Carvalho; Hany Abdulsamad; Jan Peters;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Off-Policy Natural Policy Gradient Method for a Biped Walking Using a CPG Controller [J] . Yutaka Nakamura, Takeshi Mori, Yoichi Tokita, Journal of robotics and mechatronics . 2005,第6期

机译：使用CPG控制器的Biped步行的非政策自然政策梯度方法
2. Distributed Gradient Temporal Difference Off-policy Learning With Eligibility Traces: Weak Convergence [J] . Milo? S. Stankovi?, Marko Beko, Srdjan S. Stankovi? IFAC PapersOnLine . 2020,第2期

机译：分布式梯度时间差异偏离策略学习与资格痕迹：弱收敛
3. Stable Policy Optimization via Off-Policy Divergence Regularization [J] . Ahmed Touati, Amy Zhang, Joelle Pineau, JMLR: Workshop and Conference Proceedings . 2020,第2010期

机译：通过脱助政策发散正规化稳定的政策优化
4. Statistically Efficient Off-Policy Policy Gradients [C] . Nathan Kallus, Masatoshi Uehara International Conference on Machine Learning . 2021

机译：统计上有效的截止政策梯度
5. Optimal tracking control of uncertain systems: On-policy and off-policy reinforcement learning approaches [D] . Modares, Hamidreza 2015

机译：不确定系统的最优跟踪控制：基于策略和基于策略的强化学习方法
6. Off-Policy Evaluation of the Performance of a Robot Swarm: Importance Sampling to Assess Potential Modifications to the Finite-State Machine That Controls the Robots [O] . Federico Pagnozzi, Mauro Birattari 2021

机译：对机器人群体性能的违规评估：重要的采样以评估对控制机器人的有限状态机的潜在修改
7. Distributed off-Policy Actor-Critic Reinforcement Learning with Policy Consensus [O] . Yan Zhang, Michael M. Zavlanos 2019

机译：分布禁止政策演员 - 批评政策协商委员会的批评学习

A Nonparametric Off-Policy Policy Gradient

摘要

著录项

相似文献

相关主题

期刊订阅