首页> 外文期刊>JMLR: Workshop and Conference Proceedings >Taking a hint: How to leverage loss predictors in contextual bandits?
【24h】

Taking a hint: How to leverage loss predictors in contextual bandits?

机译:提示:如何在上下文匪徒中利用损失预测因子?

获取原文
           

摘要

We initiate the study of learning in contextual bandits with the help of loss predictors. The main question we address is whether one can improve over the minimax regret $mathcal{O}(sqrt{T})$ for learning over $T$ rounds, when the total error of the predicted losses relative to the realized losses, denoted as $mathcal{E} leq T$, is relatively small. We provide a complete answer to this question, with upper and lower bounds for various settings: adversarial and stochastic environments, known and unknown $mathcal{E}$, and single and multiple predictors. We show several surprising results, such as 1) the optimal regret is $mathcal{O}(min{sqrt{T}, sqrt{mathcal{E}}T^rac{1}{4}})$ when $mathcal{E}$ is known, in contrast to the standard and better bound $mathcal{O}(sqrt{mathcal{E}})$ for non-contextual problems (such as multi-armed bandits); 2) the same bound cannot be achieved if $mathcal{E}$ is unknown, but as a remedy, $mathcal{O}(sqrt{mathcal{E}}T^rac{1}{3})$ is achievable; 3) with $M$ predictors, a linear dependence on $M$ is necessary, even though logarithmic dependence is possible for non-contextual problems. We also develop several novel algorithmic techniques to achieve matching upper bounds, including 1) a key emph{action remapping} technique for optimal regret with known $mathcal{E}$, 2) computationally efficient implementation of Catoni’s robust mean estimator via an ERM oracle in the stochastic setting with optimal regret, 3) an underestimator for $mathcal{E}$ via estimating the histogram with bins of exponentially increasing size for the stochastic setting with unknown $mathcal{E}$, and 4) a self-referential scheme for learning with multiple predictors, all of which might be of independent interest.
机译:我们在损失预测因子的帮助下启动了在语境匪徒中学习的研究。我们地址的主要问题是,当预测损失与实现损失的总误差相对于实现损失的总误差时,我们是否可以改善最小令人遗憾的$ mathcal {}( sqrt {t})$。表示为$ mathcal {e} leq t $,相对较小。我们为此问题提供了完整的答案,具有各种设置的上下界限:对冲和随机环境,已知和未知$ Mathcal {e} $,单个和多个预测器。我们展示了几种令人惊讶的结果,例如1)最佳遗憾是$ Mathcal {o}( min { sqrt {t}, sqrt { mathcal {e}} t ^ frac {1} {4} })$当$ mathcal {e} $时,与标准和更好的绑定$ mathcal {o}相比( sqrt { mathcal {e}})相比(例如多个剪断的匪徒); 2)如果$ mathcal {e} $未知,但作为补救措施,则无法实现相同的绑定,但作为补救措施,$ mathcal {o}( sqrt { mathcal {e}} t ^ frac {1} {3} )$可实现; 3)对于M $预测器,即使对数对数依赖性的非上下文问题也是可能的,也需要线性依赖。我们还开发了几种新颖的算法技术来实现匹配的上限,包括1)键 EMPH {动作重新绘制}技术,用于通过已知的$ mathcal {e} $,2)通过一个已知的$ mathcal {e} $,2)通过一个ERM Oracle在随机设置中,最佳遗憾,3)$ mathcal {e} $通过估计随机速度尺寸的频率,为随机设置的频率估算,具有未知$ mathcal {e} $和4)a与多个预测因子学习的自我引用方案,所有这些都可能是独立的兴趣。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号