首页> 外文OA文献 >Reinforcement learning with limited reinforcement: Using Bayes risk for active learning in POMDPs
【2h】

Reinforcement learning with limited reinforcement: Using Bayes risk for active learning in POMDPs

机译:强化有限的强化学习:在pOmDp中使用贝叶斯风险进行主动学习

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Acting in domains where an agent must plan several steps ahead to achieve a goal can be a challenging task, especially if the agentʼs sensors provide only noisy or partial information. In this setting, Partially Observable Markov Decision Processes (POMDPs) provide a planning framework that optimally trades between actions that contribute to the agentʼs knowledge and actions that increase the agentʼs immediate reward. However, the task of specifying the POMDPʼs parameters is often onerous. In particular, setting the immediate rewards to achieve a desired balance between information-gathering and acting is often not intuitive.In this work, we propose an approximation based on minimizing the immediate Bayes risk for choosing actions when transition, observation, and reward models are uncertain. The Bayes-risk criterion avoids the computational intractability of solving a POMDP with a multi-dimensional continuous state space; we show it performs well in a variety of problems. We use policy queries—in which we ask an expert for the correct action—to infer the consequences of a potential pitfall without experiencing its effects. More important for human–robot interaction settings, policy queries allow the agent to learn the reward model without the reward values ever being specified.
机译:在业务代表必须提前计划好几个步骤才能实现目标的领域中行动可能是一项艰巨的任务,尤其是在业务代表的传感器仅提供嘈杂或部分信息的情况下。在这种情况下,部分可观察的马尔可夫决策过程(POMDP)提供了一个计划框架,该框架可以在促成代理知识的行动与增加代理即时奖励的行动之间进行最佳交易。但是,指定POMDP参数的任务通常很繁琐。特别是,设置即时奖励以在信息收集和行动之间实现理想的平衡通常是不直观的。在这项工作中,我们提出了一种近似方法,该方法基于最小化在转换,观察和奖励模型为时选择行动的直接贝叶斯风险。不确定。贝叶斯风险准则避免了用多维连续状态空间求解POMDP的计算难点;我们证明它在各种问题上的表现都很好。我们使用策略查询(其中我们要求专家采取正确的措施)来推断潜在陷阱的后果而不会产生后果。对于人机交互设置而言,更重要的是,策略查询使代理无需指定奖励值即可学习奖励模型。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号