首页> 外文会议>Algorithmic learning theory >Deviations of Stochastic Bandit Regret
【24h】

Deviations of Stochastic Bandit Regret

机译:随机强盗后悔的偏差

获取原文
获取原文并翻译 | 示例

摘要

This paper studies the deviations of the regret in a stochastic multi-armed bandit problem. When the total number of plays n is known beforehand by the agent, Audibert et al. (2009) exhibit a policy such that with probability at least 1 - 1, the regret of the policy is of order logn. They have also shown that such a property is not shared by the popular UCBl policy of Auer et al. (2002). This work first answers an open question: it extends this negative result to any anytime policy. The second contribution of this paper is to design anytime robust policies for specific multi-armed bandit problems in which some restrictions are put on the set of possible distributions of the different arms.
机译:本文研究了随机多武装匪徒问题中后悔的偏离。当代理商预先知道总播放次数n时,Audbert等人。 (2009年)展示了一种策略,使得概率至少为1/1 / n,该策略的遗憾在于订单登录。他们还表明,Auer等人的流行UCB1政策没有共享这种属性。 (2002)。这项工作首先回答了一个悬而未决的问题:它将这种负面结果扩展到任何时候的政策。本文的第二个贡献是针对特定的多武装匪徒问题设计随时可用的稳健策略,其中对不同武器的可能分布设置了一些限制。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号