Doubly Robust Policy Evaluation and Optimization

Miroslav Dudik; Dumitru Erhan; John Langford; Lihong Li

首页> 外文期刊>Statistical science >Doubly Robust Policy Evaluation and Optimization

【24h】

Doubly Robust Policy Evaluation and Optimization

机译：双重稳健的政策评估和优化

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

We study sequential decision making in environments where rewards are only partially observed, but can be modeled as a function of observed contexts and the chosen action by the decision maker. This setting, known as contextual bandits, encompasses a wide variety of applications such as health care, content recommendation and Internet advertising. A central task is evaluation of a new policy given historic data consisting of contexts, actions and received rewards. The key challenge is that the past data typically does not faithfully represent proportions of actions taken by a new policy. Previous approaches rely either on models of rewards or models of the past policy. The former are plagued by a large bias whereas the latter have a large variance. In this work, we leverage the strengths and overcome the weaknesses of the two approaches by applying the doubly robust estimation technique to the problems of policy evaluation and optimization. We prove that this approach yields accurate value estimates when we have either a good (but not necessarily consistent) model of rewards or a good (but not necessarily consistent) model of past policy. Extensive empirical comparison demonstrates that the doubly robust estimation uniformly improves over existing techniques, achieving both lower variance in value estimation and better policies. As such, we expect the doubly robust approach to become common practice in policy evaluation and optimization.

机译：我们研究在仅部分观察到奖励但可以根据观察到的上下文和决策者选择的行为进行建模的环境中的顺序决策。这种设置称为上下文强盗，涵盖了多种应用程序，例如医疗保健，内容推荐和Internet广告。中心任务是在给定历史数据的情况下对新政策进行评估，该历史数据包括环境，行动和获得的奖励。关键的挑战是，过去的数据通常不能如实地代表新政策所采取措施的比例。先前的方法依赖于奖励模型或过去政策的模型。前者的偏差很大，而后者则差异很大。在这项工作中，我们将双重鲁棒的估算技术应用于政策评估和优化问题，从而充分利用了这两种方法的优点并克服了它们的缺点。我们证明，当我们拥有良好（但不一定一致）的奖励模型或过去政策的良好（但不一定一致）的模型时，这种方法可以产生准确的价值估算。大量的经验比较表明，双稳健估计在现有技术上均得到了统一的改进，实现了价值估计的较低方差和更好的策略。因此，我们希望加倍健壮的方法成为策略评估和优化中的普遍做法。

著录项

来源
《Statistical science》 |2014年第4期|485-511|共27页
作者
Miroslav Dudik; Dumitru Erhan; John Langford; Lihong Li;
展开▼
作者单位

Microsoft Research, New York, New York, USA;

Google Inc., Mountain View, California, USA;

Microsoft Research, New York, New York, USA;

Microsoft Research, Redmond, Washington, USA;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Contextual bandits; doubly robust estimators; causal inference;

机译：情境强盗;双稳健的估计器;因果推论;

相似文献

外文文献
中文文献
专利

1. More Robust Doubly Robust Off-policy Evaluation [J] . Mehrdad Farajtabar, Yinlam Chow, Mohammad Ghavamzadeh JMLR: Workshop and Conference Proceedings . 2018,第12期

机译：更稳健的双稳健的非政策评估
2. Dual-loop self-optimizing robust control of wind power generation with Doubly-Fed Induction Generator [J] . Chen Quan, Li Yaoyu, Seem John E. ISA Transactions . 2015,第Null期

机译：双馈感应发电机对风力发电的双回路自优化鲁棒控制
3. Robust fractional order controller based on improved particle swarm optimization algorithm for the wind turbine equipped with a doubly fed asynchronous machine [J] . Moussa Sedraoui, Djalil Boudjehem Proceedings of the Institution of Mechanical Engineers . 2012,第I9期

机译：基于改进粒子群算法的鲁棒分数阶控制器在双馈异步电机上的应用
4. Doubly Robust Off-policy Value Evaluation for Reinforcement Learning [C] . Nan Jiang, Lihong Li International Conference on Machine Learning . 2016

机译：对加强学习的双重稳健的脱助政策价值评估
5. Tractable Policies in Dynamic Robust Optimization [D] . El Housni, Omar. 2020

机译：动态鲁棒优化的贸易政策
6. A Comparative Theoretical and Computational Study on Robust Counterpart Optimization: I. Robust Linear Optimization and Robust Mixed Integer Linear Optimization [O] . Zukui Li, Ran Ding, Christodoulos A. Floudas -1

机译：鲁棒对应物优化的比较理论与计算研究：I。鲁棒线性优化和强大的混合整数线性优化
7. Doubly Robust Policy Evaluation and Optimization [O] . Dudík, Miroslav, Erhan, Dumitru, Langford, John, 2015

机译：双重稳健的政策评估与优化
8. Robust Robot Control Using Multiple Model-Based Policy Optimization and Fast Value Function-Based Planning. [R] . C. G. Atkeson 2014

机译：基于多模型策略优化和快速价值功能规划的鲁棒机器人控制。

Doubly Robust Policy Evaluation and Optimization

摘要

著录项

相似文献

相关主题

期刊订阅