首页> 外文会议>European Workshop on Reinforcement Learning >Markov Decision Processes with Arbitrary Reward Processes
【24h】

Markov Decision Processes with Arbitrary Reward Processes

机译:马尔可夫决策过程,具有任意奖励过程

获取原文

摘要

We consider a control problem where the decision maker interacts with a standard Markov decision process with the exception that the reward functions vary arbitrarily over time. We extend the notion of Hannan consistency to this setting, showing that, in hindsight, the agent can perform almost as well as every deterministic policy. We present efficient online algorithms in the spirit of reinforcement learning that ensure that the agent's performance loss, or regret, vanishes over time, provided that the environment is oblivious to the agent's actions. However, counterexamples indicate that the regret does not vanish if the environment is not oblivious.
机译:我们考虑一个控制问题,决策者与标准马尔可夫决策过程互动,但奖励函数随着时间的推移是任意的。我们将汉南的概念扩展到此设置,显示,在后视,代理人可以执行几乎和每个确定性政策。我们以强化学习的精神展示高效的在线算法,以确保代理人的性能损失或后悔随着时间的推移消失,条件是环境对代理商的行为感到沮丧。然而,如果环境没有忘记,指控率表明遗憾不会消失。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号