首页> 外文会议>Conference on Neural Information Processing Systems >Improved Regret Bounds for Bandit Combinatorial Optimization
【24h】

Improved Regret Bounds for Bandit Combinatorial Optimization

机译:强化组合优化的改进遗憾界限

获取原文
获取外文期刊封面目录资料

摘要

Bandit combinatorial optimization is a bandit framework in which a player chooses an action within a given finite set A {is contained in} {0, 1}~d and incurs a loss that is the inner product of the chosen action and an unobservable loss vector in R~d in each round. In this paper, we aim to reveal the property, which makes the bandit combinatorial optimization hard. Recently, Cohen et al. [8] obtained a lower bound Ω((dk~3T/log T)~(1/2)) of the regret, where k is the maximum l_1-norm of action vectors, and T is the number of rounds. This lower bound was achieved by considering a continuous strongly-correlated distribution of losses. Our main contribution is that we managed to improve this bound by Ω((dk~3T)~(1/2)) through applying a factor of (log T)~(1/2), which can be done by means of strongly-correlated losses with binary values. The bound derives better regret bounds for three specific examples of the bandit combinatorial optimization: the multitask bandit, the bandit ranking and the multiple-play bandit. In particular, the bound obtained for the bandit ranking in the present study addresses an open problem raised in [8]. In addition, we demonstrate that the problem becomes easier without considering correlations among entries of loss vectors. In fact, if each entry of loss vectors is an independent random variable, then, one can achieve a regret of O((dk~2T)~(1/2)), which is k~(1/2) times smaller than the lower bound shown above. The observed results indicated that correlation among losses is the reason for observing a large regret.
机译:强盗组合优化是一个强盗框架,其中玩家在给定的有限集A中选择一个动作{{0,1}〜d并引起所选动作的内部产品的损失和不可观察的损失矢量在每轮中的r〜d。在本文中,我们的目标是揭示该物业,这使得强盗组合优化很硬。最近,科恩等人。 [8]获得遗憾的下限ω((DK〜3T / log T)〜(1/2)),其中k是动作向量的最大L_1-num,而T是轮数。通过考虑持续强烈相关的损失分布,实现了下限。我们的主要贡献是,通过应用(log t)〜(1/2),我们设法通过应用(log t)〜(1/2)来改善ω((dk〜3t)〜(1/2))。这可以通过强烈的方式完成 - 具有二进制值的损耗。绑定为强盗组合优化的三个具体示例导出更好的遗憾界限:多任务强盗,强盗排名和多播匪。特别地,本研究中为强盗排名获得的束地解决了[8]中提出的打开问题。此外,我们证明问题变得更容易,而不考虑损耗向量的条目之间的相关性。实际上,如果损耗向量的每个条目是一个独立的随机变量,那么,一个人可以归因于O((dk〜2t)〜(1/2))的后悔,这是小于的k〜(1/2)倍上面所示的下界。观察结果表明,损失之间的相关性是观察大遗憾的原因。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号