...
首页> 外文期刊>JMLR: Workshop and Conference Proceedings >Optimal and Adaptive Off-policy Evaluation in Contextual Bandits
【24h】

Optimal and Adaptive Off-policy Evaluation in Contextual Bandits

机译:上下文匪徒的最佳和适应性脱助政策评估

获取原文
           

摘要

We study the off-policy evaluation problem—estimating the value of a target policy using data collected by another policy—under the contextual bandit model. We consider the general (agnostic) setting without access to a consistent model of rewards and establish a minimax lower bound on the mean squared error (MSE). The bound is matched up to constants by the inverse propensity scoring (IPS) and doubly robust (DR) estimators. This highlights the difficulty of the agnostic contextual setting, in contrast with multi-armed bandits and contextual bandits with access to a consistent reward model, where IPS is suboptimal. We then propose the SWITCH estimator, which can use an existing reward model (not necessarily consistent) to achieve a better bias-variance tradeoff than IPS and DR. We prove an upper bound on its MSE and demonstrate its benefits empirically on a diverse collection of datasets, often outperforming prior work by orders of magnitude.
机译:我们研究了违规评估问题 - 使用由另一个策略收集的数据在上下文强盗模型下估算目标策略的值。我们考虑一般(不可知论者)设置,而无需访问一致的奖励模型,并在平均方形错误(MSE)上建立最小限制。该界限与逆倾向评分(IP)和双重稳健(DR)估计的常量匹配。这突出显示不可知论的上下文设置的难度,与多武装匪和上下文匪徒相比,可以访问一致的奖励模型,其中IP是次优。然后,我们提出了交换机估计器,它可以使用现有的奖励模型(不一定是一致的)来实现比IPS和DR更好的偏差方差权衡。我们在其MSE上证明了一个上限,并在经验上展示其在各种数据集收集上的益处,通常优于数量级的先前工作。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号