Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits

机译：垃圾进入，奖励：在多武装匪徒中引导探索

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

We propose a bandit algorithm that explores by randomizing its history of rewards. Specifically, it pulls the arm with the highest mean reward in a non-parametric bootstrap sample of its history with pseudo rewards. We design the pseudo rewards such that the bootstrap mean is optimistic with a sufficiently high probability. We call our algorithm Giro, which stands for garbage in, reward out. We analyze Giro in a Bernoulli bandit and derive a O(K Δ~(-1) log n) bound on its n-round regret, where Δ is the difference in the expected rewards of the optimal and the best sub-optimal arms, and K is the number of arms. The main advantage of our exploration design is that it easily generalizes to structured problems. To show this, we propose contextual Giro with an arbitrary reward generalization model. We evaluate Giro and its contextual variant on multiple synthetic and real-world problems, and observe that it performs well.

机译：我们提出了一种强盗算法，通过随机化奖励历史来探讨。具体而言，它将手臂拉动在其历史上的非参数释放样本中具有最高的均值奖励，并具有伪奖励。我们设计伪奖励，使得自举意味着具有足够高的概率。我们称之为算法Giro，它代表垃圾，奖励。我们在伯努利强盗中分析GIRO，并导出在其N圆后悔的O（kδ〜（-1）log n），其中δ是最佳和最佳次优臂的预期奖励的差异， k是武器的数量。我们的探索设计的主要优点是它很容易推广到结构性问题。为了表明这一点，我们提出了与任意奖励泛化模型的上下文GIRO。我们在多个合成和现实世界问题上评估Giro及其语境变体，并观察它表现良好。

著录项

来源
《International Conference on Machine Learning》|2019年|6368-7043p|共16页
会议地点
作者
Branislav Kveton; Csaba Szepesvari; Sharan Vaswani; Zheng Wen; Mohammad Ghavamzadeh; Tor Lattimore;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP181-53;
关键词

相似文献

外文文献
中文文献
专利

1. Randomized allocation with nonparametric estimation for contextual multi-armed bandits with delayed rewards [J] . Arya Sakshi, Yang Yuhong Statistics & Probability Letters . 2020,第1期

机译：随机分配与延迟奖励的上下文多武装匪徒的非参数分配
2. A numerical analysis of allocation strategies for the multi-armed bandit problem under delayed rewards conditions in digital campaign management [J] . Martin Miguel, Jimenez-Martin Antonio, Mateos Alfonso Neurocomputing . 2019,第Octa21期

机译：数字战役管理中延迟奖励条件下多臂匪问题分配策略的数值分析
3. Possibilistic reward methods for the multi-armed bandit problem [J] . Martin Miguel, Jimenez-Martin Antonio, Mateos Alfonso Neurocomputing . 2018,第OCTa8期

机译：多武装匪徒问题的可能性奖励方法
4. Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits [C] . Branislav Kveton, Csaba Szepesvari, Sharan Vaswani, International Conference on Machine Learning . 2019

机译：垃圾进入，奖励：在多武装匪徒中引导探索
5. Offline Evaluation of Multi-Armed Bandit Algorithms Using Bootstrapped Replay on Expanded Data [D] . Dai, Jin. 2021

机译：在扩展数据上使用引导重播的多武装强盗算法的离线评估
6. Anytime Exploration for Multi-armed Bandits using ConfidenceInformation [O] . Kwang-Sung Jun, Robert Nowak -1

机译：随时随地探索多臂匪信息
7. The Multi-Armed Bandit Problem under Delayed Rewards Conditions in Digital Campaign Management [O] . M. Martin, A. Jimenez-Martin, A. Mateos 2019

机译：数字竞选管理中延迟奖励条件下的多武装强盗问题

Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits

摘要

著录项

相似文献

相关主题

期刊订阅