【24h】

The Impatient May Use Limited Optimism to Minimize Regret

机译:不耐烦的人可能会使用有限的乐观情绪来最大程度地减少后悔

获取原文

摘要

Discounted-sum games provide a formal model for the study of reinforcement learning, where the agent is enticed to get rewards early since later rewards are discounted. When the agent interacts with the environment, she may realize that, with hindsight, she could have increased her reward by playing differently: this difference in outcomes constitutes her regret value. The agent may thus elect to follow a regret-minimal strategy. In this paper, it is shown that (1) there always exist regret-minimal strategies that are admissible -a strategy being inadmissible if there is another strategy that always performs better; (2) computing the minimum possible regret or checking that a strategy is regret-minimal can be done in coNP~(NP), disregarding the computational cost of numerical analysis (otherwise, this bound becomes PSpace).
机译:折扣和博弈为强化学习的研究提供了一个正式的模型,在该模型中,代理商被诱使尽早获得奖励,因为后来的奖励被折扣了。当特工与环境互动时,她可能会意识到,事后看来,她可以通过发挥不同的方式来增加自己的报酬:这种结果差异构成了她的遗憾价值。代理因此可以选择遵循后悔最小策略。本文表明,(1)总是存在后悔的最小策略,这些策略是可以接受的;如果存在另一种总是表现更好的策略,则该策略是不可接受的; (2)可以在coNP〜(NP)中完成计算最小可能后悔或检查策略是否后悔最小的方法,而无需考虑数值分析的计算成本(否则,该界限变为PSpace)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号