首页> 外文期刊>Theoretical computer science >Nonstochastic bandits: Countable decision set, unbounded costs and reactive environments
【24h】

Nonstochastic bandits: Countable decision set, unbounded costs and reactive environments

机译:非随机强盗:可数决策集,无限制成本和被动环境

获取原文
获取原文并翻译 | 示例
       

摘要

The nonstochastic multi-armed bandit problem, first studied by Auer, Cesa-Bianchi, Freund, and Schapire in 1995, is a game of repeatedly choosing one decision from a set of decisions ("experts"), under partial observation: In each round t, only the cost of the decision played is observable. A regret minimization algorithm plays this game while achieving sublinear regret relative to each decision. It is known that an adversary controlling the costs of the decisions can force the player a regret growing as t(1/2) in the time t. In this work, we propose the first algorithm for a countably infinite set of decisions, that achieves a regret upper bounded by O (t(1/2+epsilon)), i.e. arbitrarily close to optimal order. To this aim, we build on the "follow the perturbed leader" principle, which dates back to work by Hannan in 1957. Our results hold against an adaptive adversary, for both the expected and high probability regret of the learner w.r.t. each decision. In the second part of the paper, we consider reactive problem settings, that is, situations where the learner's decisions impact on the future behaviour of the adversary, and a strong strategy can draw benefit from well chosen past actions. We present a variant of our regret minimization algorithm which has still regret of order at most t(1/2+epsilon) relative to such strong strategies, and even sublinear regret not exceeding O(t(4/5)) w.r.t. the hypothetical (without external interference) performance of a strong strategy. We show how to combine the regret minimizer with a universal class of experts, given by the countable set of programs on some fixed universal Turing machine. This defines a universal learner with sublinear regret relative to any computable strategy. (C) 2008 Elsevier B.V. All rights reserved.
机译:Auer,Cesa-Bianchi,Freund和Schapire于1995年首次研究了非随机的多武装匪徒问题,该游戏是在部分观察下从一组决策(“专家”)中反复选择一个决策的游戏:在每个回合中t,只能观察到决策的成本。后悔最小化算法在玩游戏时会相对于每个决策实现次线性后悔。众所周知,控制决策成本的对手会迫使玩家在时间t内随着t(1/2)增长而后悔。在这项工作中,我们提出了第一个用于可数的无限决策集的算法,该算法实现了以O(t(1/2 + epsilon))为上限的后悔上限,即任意接近最佳顺序。为此,我们建立在“跟随被干扰的领导者”原则之上,该原则可追溯到1957年由汉南(Hannan)运用。我们的研究结果与适应性对手背道而驰,因为学习者怀有巨大的期望和遗憾。每个决定。在本文的第二部分中,我们考虑了反应性问题的设置,即学习者的决策会影响对手的未来行为,而强有力的策略可以从过去的良好选择中受益。我们提出了我们的后悔最小化算法的一种变体,相对于这种强大的策略,该算法仍然对至多t(1/2 + epsilon)感到后悔,甚至亚线性后悔不超过O(t(4/5))w.r.t.强大策略的假设(无外部干扰)性能。我们展示了如何将后悔最小化器与通用专家类结合起来,这由固定的通用图灵机上可数套程序给出。这定义了一个通用学习者,它相对于任何可计算策略都具有次线性遗憾。 (C)2008 Elsevier B.V.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号