Bandits with Knapsacks

机译：带背包土匪

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Multi-armed bandit problems are the predominant theoretical model of exploration-exploitation tradeoffs in learning, and they have countless applications ranging from medical trials, to communication networks, to Web search and advertising. In many of these application domains the learner may be constrained by one or more supply (or budget) limits, in addition to the customary limitation on the time horizon. The literature lacks a general model encompassing these sorts of problems. We introduce such a model, called "bandits with knapsacks", that combines aspects of stochastic integer programming with online learning. A distinctive feature of our problem, in comparison to the existing regret-minimization literature, is that the optimal policy for a given latent distribution may significantly outperform the policy that plays the optimal fixed arm. Consequently, achieving sub linear regret in the bandits-with-knapsacks problem is significantly more challenging than in conventional bandit problems. We present two algorithms whose reward is close to the information-theoretic optimum: one is based on a novel "balanced exploration" paradigm, while the other is a primal-dual algorithm that uses multiplicative updates. Further, we prove that the regret achieved by both algorithms is optimal up to polylogarithmic factors. We illustrate the generality of the problem by presenting applications in a number of different domains including electronic commerce, routing, and scheduling. As one example of a concrete application, we consider the problem of dynamic posted pricing with limited supply and obtain the first algorithm whose regret, with respect to the optimal dynamic policy, is sub linear in the supply.

机译：多臂匪徒问题是学习中探索与开发权衡的主要理论模型，它们在医学试验，通信网络，网络搜索和广告等领域拥有无数的应用。在许多这些应用领域中，除了对时间范围的常规限制外，学习者还可能受到一个或多个供应（或预算）限制的约束。文献缺乏涵盖此类问题的通用模型。我们介绍了一种称为“带背包的匪徒”的模型，该模型将随机整数编程的各个方面与在线学习结合在一起。与现有的后悔最小化文献相比，我们问题的一个显着特征是，对于给定的潜在分布，最优策略可能会明显优于发挥最优固定臂的策略。因此，在带有背包的匪徒问题中实现亚线性遗憾要比在传统的匪徒问题中更具挑战性。我们提出两种奖励接近信息理论最优值的算法：一种基于新颖的“平衡探索”范式，而另一种则是使用乘法更新的原始对偶算法。此外，我们证明了这两种算法所实现的遗憾在多对数因素的影响下是最佳的。我们通过在许多不同的领域（包括电子商务，路由和调度）中展示应用程序来说明问题的普遍性。作为一个具体应用的示例，我们考虑了供应受限的动态发布定价问题，并获得了第一个算法，该算法对于最优动态策略而言，在供应中具有次线性。

著录项

来源
《IEEE Annual Symposium on Foundations of Computer Science》|2013年|207-216|共10页
会议地点
作者
Badanidiyuru Ashwinkumar; Kleinberg Robert; Slivkins Aleksandrs;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Multi-armed bandits; dynamic ad allocation; dynamic pricing; dynamic procurement; exploration-exploitation tradeoff; regret; stochastic packing;

机译：多臂匪;动态广告分配;动态定价;动态采购;开发与利用的权衡;后悔;随机包装;

相似文献

外文文献
中文文献
专利

1. Online Learning with Vector Costs and Bandits with Knapsacks [J] . Thomas Kesselheim, Sahil Singla JMLR: Workshop and Conference Proceedings . 2020,第2010期

机译：在线学习，矢量成本和带背包的匪徒
2. Bandits with Knapsacks [J] . Badanidiyuru Ashwinkumar, Kleinberg Robert, Slivkins Aleksandrs Journal of the Association for Computing Machinery . 2018,第3期

机译：带背包土匪
3. Combinatorial Semi-Bandits with Knapsacks [J] . Karthik Abinav Sankararaman, Aleksandrs Slivkins JMLR: Workshop and Conference Proceedings . 2018,第2010期

机译：带背包组合半强盗
4. Unifying the Stochastic and the Adversarial Bandits with Knapsack [C] . Anshuka Rangi, Massimo Franceschetti, Long Tran-Thanh International Joint Conference on Artificial Intelligence . 2020

机译：用背包统一随机和对抗性匪徒
5. Adaptive Preference Learning with Bandit Feedback: Information Filtering, Dueling Bandits and Incentivizing Exploration [D] . Chen, Bangrui. 2017

机译：带有土匪反馈的自适应偏好学习：信息过滤，决斗土匪和激励探索
6. Smoking and the bandit: A preliminary study of smoker and non-smoker differences in exploratory behavior measured with a multi-armed bandit task [O] . Merideth A. Addicott, John M. Pearson, Jessica Wilson, -1

机译：吸烟和强盗：用多武装强盗任务测量的探索性行为的吸烟者和非吸烟者差异的初步研究
7. Adversarial Bandits with Knapsacks [O] . Nicole Immorlica, Karthik Abinav Sankararaman, Robert Schapire, 2019

机译：带背包的对抗匪徒

Bandits with Knapsacks

摘要

著录项

相似文献

相关主题

期刊订阅