首页> 外文会议>International Conference on Machine Learning >Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits
【24h】

Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits

机译:垃圾进入,奖励:在多武装匪徒中引导探索

获取原文

摘要

We propose a bandit algorithm that explores by randomizing its history of rewards. Specifically, it pulls the arm with the highest mean reward in a non-parametric bootstrap sample of its history with pseudo rewards. We design the pseudo rewards such that the bootstrap mean is optimistic with a sufficiently high probability. We call our algorithm Giro, which stands for garbage in, reward out. We analyze Giro in a Bernoulli bandit and derive a O(K Δ~(-1) log n) bound on its n-round regret, where Δ is the difference in the expected rewards of the optimal and the best sub-optimal arms, and K is the number of arms. The main advantage of our exploration design is that it easily generalizes to structured problems. To show this, we propose contextual Giro with an arbitrary reward generalization model. We evaluate Giro and its contextual variant on multiple synthetic and real-world problems, and observe that it performs well.
机译:我们提出了一种强盗算法,通过随机化奖励历史来探讨。具体而言,它将手臂拉动在其历史上的非参数释放样本中具有最高的均值奖励,并具有伪奖励。我们设计伪奖励,使得自举意味着具有足够高的概率。我们称之为算法Giro,它代表垃圾,奖励。我们在伯努利强盗中分析GIRO,并导出在其N圆后悔的O(kδ〜(-1)log n),其中δ是最佳和最佳次优臂的预期奖励的差异, k是武器的数量。我们的探索设计的主要优点是它很容易推广到结构性问题。为了表明这一点,我们提出了与任意奖励泛化模型的上下文GIRO。我们在多个合成和现实世界问题上评估Giro及其语境变体,并观察它表现良好。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号