【24h】

Efficient Probabilistic Performance Bounds for Inverse Reinforcement Learning

机译:逆钢筋学习的高效概率性能范围

获取原文

摘要

In the field of reinforcement learning there has been recent progress towards safety and high-confidence bounds on policy performance. However, to our knowledge, no practical methods exist for determining high-confidence policy performance bounds in the inverse reinforcement learning setting-where the true reward function is unknown and only samples of expert behavior are given. We propose a sampling method based on Bayesian inverse reinforcement learning that uses demonstrations to determine practical high-confidence upper bounds on the α-worst-case difference in expected return between any evaluation policy and the optimal policy under the expert's unknown reward function. We evaluate our proposed bound on both a standard grid navigation task and a simulated driving task and achieve tighter and more accurate bounds than a feature count-based baseline. We also give examples of how our proposed bound can be utilized to perform risk-aware policy selection and risk-aware policy improvement. Because our proposed bound requires several orders of magnitude fewer demonstrations than existing high-confidence bounds, it is the first practical method that allows agents that learn from demonstration to express confidence in the quality of their learned policy.
机译:在加强学习领域,最近有利于政策表现的安全和高信任界的进展。然而,为了我们的知识,没有存在用于确定逆钢筋学习设置中的高信任政策性能的实际方法 - 真正的奖励功能未知,只给出专家行为的样本。我们提出了一种基于贝叶斯逆强化学习的采样方法,该方法使用演示在任何评估政策和专家未知奖励功能下的预期回报之间的α最差异差异上确定实际高频率的上限。我们在标准网格导航任务和模拟驾驶任务上评估我们提出的绑定,并实现比特征计数基准的基线更紧凑,更准确的界限。我们还举例说明我们建议的绑定如何用于执行风险感知的政策选择和风险感知政策改进。因为我们的拟议绑定需要多个数量级的演示,而不是现有的高信任界限,这是允许学习示范中的代理商来表达对学习政策质量的信心的第一种实用方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号