【24h】

Design for an Optimal Probe

机译:最佳探针的设计

获取原文

摘要

Given a Markov decision process (MDP) with expressed prior uncertainties in the process transition probabilities, we consider the problem of computing a policy that optimizes expected total (finite-horizon) reward. Implicitly, such a policy would effectively resolve the "exploration-versus-exploitation tradeoff' faced, for example, by an agent that seeks to optimize total reinforcement obtained over the entire duration of its interaction with an uncertain world. A Bayesian formulation leads to an associated MDP defined over a set of generalized process "hy-perstates" whose cardinality grows exponentially with the planning horizon. Here we retain the full Bayesian framework, but sidestep intractability by applying techniques from reinforcement learning theory. We apply our resulting actor-critic algorithm to a problem of "optimal probing," in which the task is to identify unknown transition probabilities of an MDP using online experience.
机译:给定一个马尔可夫决策过程(MDP),该过程在过程转换概率中具有明确的先验不确定性,因此我们考虑计算优化预期总(有限水平)报酬的策略的问题。隐含地,这样的政策将有效地解决例如寻求最佳化在与不确定世界相互作用的整个过程中获得的总加固量的代理商所面临的“探索与开发的权衡”。关联的MDP定义了一组通用过程“ hy-perstate”,其基数随规划范围呈指数增长,此处我们保留了完整的贝叶斯框架,但通过应用强化学习理论中的技术来回避难处理性,并应用了所得的actor-critic算法解决“最佳探测”的问题,其中的任务是使用在线经验来确定MDP的未知转换概率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号