Nonstochastic bandits: Countable decision set, unbounded costs and reactive environments

Poland J

首页> 外文期刊>Theoretical computer science >Nonstochastic bandits: Countable decision set, unbounded costs and reactive environments

【24h】

Nonstochastic bandits: Countable decision set, unbounded costs and reactive environments

机译：非随机强盗：可数决策集，无限制成本和被动环境

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

The nonstochastic multi-armed bandit problem, first studied by Auer, Cesa-Bianchi, Freund, and Schapire in 1995, is a game of repeatedly choosing one decision from a set of decisions ("experts"), under partial observation: In each round t, only the cost of the decision played is observable. A regret minimization algorithm plays this game while achieving sublinear regret relative to each decision. It is known that an adversary controlling the costs of the decisions can force the player a regret growing as t(1/2) in the time t. In this work, we propose the first algorithm for a countably infinite set of decisions, that achieves a regret upper bounded by O (t(1/2+epsilon)), i.e. arbitrarily close to optimal order. To this aim, we build on the "follow the perturbed leader" principle, which dates back to work by Hannan in 1957. Our results hold against an adaptive adversary, for both the expected and high probability regret of the learner w.r.t. each decision. In the second part of the paper, we consider reactive problem settings, that is, situations where the learner's decisions impact on the future behaviour of the adversary, and a strong strategy can draw benefit from well chosen past actions. We present a variant of our regret minimization algorithm which has still regret of order at most t(1/2+epsilon) relative to such strong strategies, and even sublinear regret not exceeding O(t(4/5)) w.r.t. the hypothetical (without external interference) performance of a strong strategy. We show how to combine the regret minimizer with a universal class of experts, given by the countable set of programs on some fixed universal Turing machine. This defines a universal learner with sublinear regret relative to any computable strategy. (C) 2008 Elsevier B.V. All rights reserved.

机译：Auer，Cesa-Bianchi，Freund和Schapire于1995年首次研究了非随机的多武装匪徒问题，该游戏是在部分观察下从一组决策（“专家”）中反复选择一个决策的游戏：在每个回合中t，只能观察到决策的成本。后悔最小化算法在玩游戏时会相对于每个决策实现次线性后悔。众所周知，控制决策成本的对手会迫使玩家在时间t内随着t（1/2）增长而后悔。在这项工作中，我们提出了第一个用于可数的无限决策集的算法，该算法实现了以O（t（1/2 + epsilon））为上限的后悔上限，即任意接近最佳顺序。为此，我们建立在“跟随被干扰的领导者”原则之上，该原则可追溯到1957年由汉南（Hannan）运用。我们的研究结果与适应性对手背道而驰，因为学习者怀有巨大的期望和遗憾。每个决定。在本文的第二部分中，我们考虑了反应性问题的设置，即学习者的决策会影响对手的未来行为，而强有力的策略可以从过去的良好选择中受益。我们提出了我们的后悔最小化算法的一种变体，相对于这种强大的策略，该算法仍然对至多t（1/2 + epsilon）感到后悔，甚至亚线性后悔不超过O（t（4/5））w.r.t.强大策略的假设（无外部干扰）性能。我们展示了如何将后悔最小化器与通用专家类结合起来，这由固定的通用图灵机上可数套程序给出。这定义了一个通用学习者，它相对于任何可计算策略都具有次线性遗憾。（C）2008 Elsevier B.V.保留所有权利。

著录项

来源
《Theoretical computer science》 |2008年第3期|共17页
作者
Poland J;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类一般性问题;
关键词
multi-armed bandit; nonstochastic bandit; countable decision set; partial observations; reactive environments; PREDICTION;

机译：多武装土匪;非随机土匪;可计数的决策集;局部观察;反应性环境;预测;
入库时间 2022-08-18 18:54:11

相似文献

外文文献
中文文献
专利

1. Nonstochastic bandits: Countable decision set, unbounded costs and reactive environments [J] . Poland J Theoretical computer science . 2008,第1a3期

机译：非随机强盗：可数决策集，无限制成本和被动环境
2. ON THE MINIMUM PAIR APPROACH FOR AVERAGE COST MARKOV DECISION PROCESSES WITH COUNTABLE DISCRETE ACTION SPACES AND STRICTLY UNBOUNDED COSTS [J] . SIAM Journal on Control and Optimization . 2020,第2期

机译：关于平均成本马尔可夫决策过程的最小对方法，可数离散行动空间和严格无限成本
3. COUNTABLE STATE MARKOV DECISION PROCESSES WITH UNBOUNDED JUMP RATES AND DISCOUNTED COST: OPTIMALITY EQUATION AND APPROXIMATIONS [J] . Blok H., Spieksma F. M. Advances in applied probability . 2015,第4期

机译：跳率无限制且成本折扣的可数状态马尔可夫决策过程：最优性方程和逼近
4. Individual Regret in Cooperative Nonstochastic Multi-Armed Bandits [C] . Yogev Bar-On, Yishay Mansour Conference on Neural Information Processing Systems . 2020

机译：个人遗憾的合作非旋偶性多武装匪徒
5. Strong Well-foundedness and the Genericity of Countable Sets. [D] . McCarty, Dominic. 2014

机译：强大的基础和可数集的通用性。
6. ON POWER SERIES WITH UNBOUNDED CLUSTER SETS AND FUNCTIONS OF CLASS H2 WITH MEAGER SETS OF RADIAL CONTINUITY [O] . F. Bagemihl 1956

机译：具有无界簇集的幂级数和具有径向连续性度量集的H2类功能
7. Nonstochastic bandits: Countable decision set, unbounded costs and reactive environments [O] . Poland Jan 2008

机译：非随机强盗：可数决策集，无限制成本和被动环境
8. Countable State Discounted Markovian Decision Processes with Unbounded Rewards [R] . Harrison, J. M. 1970

机译：具有无限奖励的可数州折现马尔可夫决策过程

Nonstochastic bandits: Countable decision set, unbounded costs and reactive environments

摘要

著录项

相似文献

相关主题

期刊订阅