首页> 外文期刊>Neural computation >Polynomial-Time Algorithms for Multiple-Arm Identification with Full-Bandit Feedback
【24h】

Polynomial-Time Algorithms for Multiple-Arm Identification with Full-Bandit Feedback

机译:具有全强盗反馈的多臂识别多项式算法

获取原文
获取原文并翻译 | 示例

摘要

We study the problem of stochastic multiple-arm identification, where an agent sequentially explores a size-ksubset of arms (also known as asuper arm) from givennarms and tries to identify the best super arm. Most work so far has considered the semi-bandit setting, where the agent can observe the reward of each pulled arm or assumed each arm can be queried at each round. However, in real-world applications, it is costly or sometimes impossible to observe a reward of individual arms. In this study, we tackle the full-bandit setting, where only a noisy observation of the total sum of a super arm is given at each pull. Although our problem can be regarded as an instance of the best arm identification in linear bandits, a naive approach based on linear bandits is computationally infeasible since the number of super armsKis exponential. To cope with this problem, we first design a polynomial-time approximation algorithm for a 0-1 quadratic programming problem arising in confidence ellipsoid maximization. Based on our approximation algorithm, we propose a bandit algorithm whose computation time isO(logK), thereby achieving an exponential speedup over linear bandit algorithms. We provide a sample complexity upper bound that is still worst-case optimal. Finally, we conduct experiments on large-scale data sets with more than 1010super arms, demonstrating the superiority of our algorithms in terms of both the computation time and the sample complexity.
机译:我们研究了随机多臂识别的问题,其中代理顺序地探索了来自鉴定的武器(也称为消泡臂)的尺寸-KSubset,并试图识别最佳的超级臂。到目前为止,大多数工作都考虑了半燃烧设置,其中代理可以观察每个拉动臂的奖励或者在每轮中都能查询每个臂。然而,在现实世界的应用中,昂贵或有时不可能观察个人武器的奖励。在这项研究中,我们解决了全匪设置,其中仅在每个拉动时对超级臂的总和的噪声观察。虽然我们的问题可以被视为线性匪徒的最佳臂识别的实例,但是由于超级枪削幂的数量,基于线性匪徒的基于线性匪徒的天真方法是可计算的。为了应对这个问题,我们首先设计一种在置信椭球最大化中产生的0-1二次编程问题的多项式时间近似算法。基于我们的近似算法,我们提出了一种强盗算法,其计算时间ISO(ogk),从而实现了线性强盗算法的指数加速。我们提供了一个样本复杂性上限,仍然是最糟糕的最佳状态。最后,我们在具有超过1010psuper的大规模数据集上进行实验,术语在计算时间和采样复杂性方面展示了我们算法的优越性。

著录项

  • 来源
    《Neural computation》 |2020年第9期|1733-1773|共41页
  • 作者单位

    Univ Tokyo Bunkyo Ku Tokyo 1130333 Japan|RIKEN Ctr Adv Intelligence Project Chuo Ku Tokyo 1030027 Japan;

    Univ Tokyo Bunkyo Ku Tokyo 1130333 Japan|RIKEN Ctr Adv Intelligence Project Chuo Ku Tokyo 1030027 Japan;

    RIKEN Ctr Adv Intelligence Project Chuo Ku Tokyo 1030027 Japan;

    Univ Tokyo Bunkyo Ku Tokyo 1130333 Japan|RIKEN Ctr Adv Intelligence Project Chuo Ku Tokyo 1030027 Japan;

    Univ Tokyo Bunkyo Ku Tokyo 1130333 Japan|RIKEN Ctr Adv Intelligence Project Chuo Ku Tokyo 1030027 Japan;

  • 收录信息 美国《科学引文索引》(SCI);美国《化学文摘》(CA);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号