【24h】

Bandits with Knapsacks

机译:带背包土匪

获取原文

摘要

Multi-armed bandit problems are the predominant theoretical model of exploration-exploitation tradeoffs in learning, and they have countless applications ranging from medical trials, to communication networks, to Web search and advertising. In many of these application domains the learner may be constrained by one or more supply (or budget) limits, in addition to the customary limitation on the time horizon. The literature lacks a general model encompassing these sorts of problems. We introduce such a model, called "bandits with knapsacks", that combines aspects of stochastic integer programming with online learning. A distinctive feature of our problem, in comparison to the existing regret-minimization literature, is that the optimal policy for a given latent distribution may significantly outperform the policy that plays the optimal fixed arm. Consequently, achieving sub linear regret in the bandits-with-knapsacks problem is significantly more challenging than in conventional bandit problems. We present two algorithms whose reward is close to the information-theoretic optimum: one is based on a novel "balanced exploration" paradigm, while the other is a primal-dual algorithm that uses multiplicative updates. Further, we prove that the regret achieved by both algorithms is optimal up to polylogarithmic factors. We illustrate the generality of the problem by presenting applications in a number of different domains including electronic commerce, routing, and scheduling. As one example of a concrete application, we consider the problem of dynamic posted pricing with limited supply and obtain the first algorithm whose regret, with respect to the optimal dynamic policy, is sub linear in the supply.
机译:多臂匪徒问题是学习中探索与开发权衡的主要理论模型,它们在医学试验,通信网络,网络搜索和广告等领域拥有无数的应用。在许多这些应用领域中,除了对时间范围的常规限制外,学习者还可能受到一个或多个供应(或预算)限制的约束。文献缺乏涵盖此类问题的通用模型。我们介绍了一种称为“带背包的匪徒”的模型,该模型将随机整数编程的各个方面与在线学习结合在一起。与现有的后悔最小化文献相比,我们问题的一个显着特征是,对于给定的潜在分布,最优策略可能会明显优于发挥最优固定臂的策略。因此,在带有背包的匪徒问题中实现亚线性遗憾要比在传统的匪徒问题中更具挑战性。我们提出两种奖励接近信息理论最优值的算法:一种基于新颖的“平衡探索”范式,而另一种则是使用乘法更新的原始对偶算法。此外,我们证明了这两种算法所实现的遗憾在多对数因素的影响下是最佳的。我们通过在许多不同的领域(包括电子商务,路由和调度)中展示应用程序来说明问题的普遍性。作为一个具体应用的示例,我们考虑了供应受限的动态发布定价问题,并获得了第一个算法,该算法对于最优动态策略而言,在供应中具有次线性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号