...
首页> 外文期刊>Statistical science >Doubly Robust Policy Evaluation and Optimization
【24h】

Doubly Robust Policy Evaluation and Optimization

机译:双重稳健的政策评估和优化

获取原文
获取原文并翻译 | 示例
           

摘要

We study sequential decision making in environments where rewards are only partially observed, but can be modeled as a function of observed contexts and the chosen action by the decision maker. This setting, known as contextual bandits, encompasses a wide variety of applications such as health care, content recommendation and Internet advertising. A central task is evaluation of a new policy given historic data consisting of contexts, actions and received rewards. The key challenge is that the past data typically does not faithfully represent proportions of actions taken by a new policy. Previous approaches rely either on models of rewards or models of the past policy. The former are plagued by a large bias whereas the latter have a large variance. In this work, we leverage the strengths and overcome the weaknesses of the two approaches by applying the doubly robust estimation technique to the problems of policy evaluation and optimization. We prove that this approach yields accurate value estimates when we have either a good (but not necessarily consistent) model of rewards or a good (but not necessarily consistent) model of past policy. Extensive empirical comparison demonstrates that the doubly robust estimation uniformly improves over existing techniques, achieving both lower variance in value estimation and better policies. As such, we expect the doubly robust approach to become common practice in policy evaluation and optimization.
机译:我们研究在仅部分观察到奖励但可以根据观察到的上下文和决策者选择的行为进行建模的环境中的顺序决策。这种设置称为上下文强盗,涵盖了多种应用程序,例如医疗保健,内容推荐和Internet广告。中心任务是在给定历史数据的情况下对新政策进行评估,该历史数据包括环境,行动和获得的奖励。关键的挑战是,过去的数据通常不能如实地代表新政策所采取措施的比例。先前的方法依赖于奖励模型或过去政策的模型。前者的偏差很大,而后者则差异很大。在这项工作中,我们将双重鲁棒的估算技术应用于政策评估和优化问题,从而充分利用了这两种方法的优点并克服了它们的缺点。我们证明,当我们拥有良好(但不一定一致)的奖励模型或过去政策的良好(但不一定一致)的模型时,这种方法可以产生准确的价值估算。大量的经验比较表明,双稳健估计在现有技术上均得到了统一的改进,实现了价值估计的较低方差和更好的策略。因此,我们希望加倍健壮的方法成为策略评估和优化中的普遍做法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号