【24h】

Gambler's Ruin Bandit Problem

机译:赌徒的废墟土匪问题

获取原文

摘要

In this paper, we propose a new multi-armed bandit problem called the Gambler's Ruin Bandit Problem (GRBP). In the GRBP, the learner proceeds in a sequence of rounds, where each round is a Markov Decision Process (MDP) with two actions (arms): a continuation action that moves the learner randomly over the state space around the current state; and a terminal action that moves the learner directly into one of the two terminal states (goal and dead-end state). The current round ends when a terminal state is reached, and the learner incurs a positive reward only when the goal state is reached. The objective of the learner is to maximize its long-term reward (expected number of times the goal state is reached), without having any prior knowledge on the state transition probabilities. We first prove a result on the form of the optimal policy for the GRBP. Then, we define the regret of the learner with respect to an omnipotent oracle, which acts optimally in each round, and prove that it increases logarithmically over rounds. We also identify a condition under which the learner's regret is bounded. A potential application of the GRBP is optimal medical treatment assignment, in which the continuation action corresponds to a conservative treatment and the terminal action corresponds to a risky treatment such as surgery.
机译:在本文中,我们提出了一个新的多武装匪徒问题,称为赌徒废墟匪徒问题(GRBP)。在GRBP中,学习者按一系列回合进行,其中每个回合都是具有两个动作(手臂)的马尔可夫决策过程(MDP):一种持续动作,使学习者在当前状态周围的状态空间内随机移动;以及将学习者直接移动到两个最终状态(目标和最终状态)之一的最终动作。当前回合在达到最终状态时结束,学习者仅在达到目标状态时才获得正奖励。学习者的目标是最大化其长期奖励(达到目标状态的预期次数),而无需任何有关状态转换概率的先验知识。我们首先以GRBP最优策略的形式证明结果。然后,我们定义了学习者对于全能神谕的遗憾,该神谕在每一回合中均表现最佳,并证明其在每一回合中均呈对数增长。我们还确定了限制学习者后悔的条件。 GRBP的潜在应用是最佳的医学治疗分配,其中持续作用对应于保守治疗,而终末作用对应于危险的治疗,例如手术。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号