首页> 外文期刊>JMLR: Workshop and Conference Proceedings >PAC Battling Bandits in the Plackett-Luce Model
【24h】

PAC Battling Bandits in the Plackett-Luce Model

机译:PAC在Plackett-Luce模型中与土匪作战

获取原文
       

摘要

We introduce the probably approximately correct (PAC) emph{Battling-Bandit} problem with the Plackett-Luce (PL) subset choice model–an online learning framework where at each trial the learner chooses a subset of $k$ arms from a fixed set of $n$ arms, and subsequently observes a stochastic feedback indicating preference information of the items in the chosen subset, e.g., the most preferred item or ranking of the top $m$ most preferred items etc. The objective is to identify a near-best item in the underlying PL model with high confidence. This generalizes the well-studied PAC emph{Dueling-Bandit} problem over $n$ arms, which aims to recover the emph{best-arm} from pairwise preference information, and is known to require $O(rac{n}{epsilon^2} ln rac{1}{delta})$ sample complexity. We study the sample complexity of this problem under various feedback models: (1) Winner of the subset (WI), and (2) Ranking of top-$m$ items (TR) for $2le m le k$. We show, surprisingly, that with winner information (WI) feedback over subsets of size $2 leq k leq n$, the best achievable sample complexity is still $Oleft( rac{n}{epsilon^2} ln rac{1}{delta}ight)$, independent of $k$, and the same as that in the Dueling Bandit setting ($k=2$). For the more general top-$m$ ranking (TR) feedback model, we show a significantly smaller lower bound on sample complexity of $Omegaigg( rac{n}{mepsilon^2} ln rac{1}{delta}igg)$, which suggests a multiplicative reduction by a factor ${m}$ owing to the additional information revealed from preferences among $m$ items instead of just $1$. We also propose two algorithms for the PAC problem with the TR feedback model with optimal (upto logarithmic factors) sample complexity guarantees, establishing the increase in statistical efficiency from exploiting rank-ordered feedback.
机译:我们使用Plackett-Luce(PL)子集选择模型介绍一个大概正确的(PAC) emph {Battling-Bandit}问题-一种在线学习框架,在该在线学习框架中,学习者从固定样本中选择$ k $子集的子集一组$ n $的武器,随后观察到随机反馈,该反馈指示所选子集中的项目的偏好信息,例如,最喜欢的项目或排名靠前的$ m $最喜欢的项目的排名等。目标是确定附近-高可信度的基础PL模型中的最佳项目。这将广泛研究经过深入研究的$ n $臂上的PAC emph {Dueling-Bandit}问题,其目的是从成对偏好信息中恢复 emph {best-arm},并且已知需要$ O( frac {n } { epsilon ^ 2} ln frac {1} { delta})$样本复杂度。我们研究了在各种反馈模型下该问题的样本复杂性:(1)子集的获胜者(WI),以及(2)$ 2 le m le k $的$ m $项(TR)的排名。令人惊讶地,我们显示,在大小为$ 2 leq k leq n $的子集上的获胜者信息(WI)反馈下,可实现的最佳样本复杂度仍然为$ O left( frac {n} { epsilon ^ 2} 在 frac {1} { delta} right)$中,与$ k $无关,并且与Dueling Bandit设置中的值相同($ k = 2 $)。对于更一般的$ m $排名(TR)反馈模型,我们显示了$ Omega bigg( frac {n} {m epsilon ^ 2} ln frac { 1} { delta} bigg)$,这表明可乘数减少了$ {m} $,这是因为$ m $项中的偏好显示了更多信息,而不仅仅是$ 1 $。我们还针对具有最优(最多对数因子)样本复杂度保证的TR反馈模型,针对PAC问题提出了两种算法,通过利用排序反馈来提高统计效率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号