Optimal and Adaptive Off-policy Evaluation in Contextual Bandits

Yu-Xiang Wang; Alekh Agarwal; Miroslav Dud??k

首页> 外文期刊>JMLR: Workshop and Conference Proceedings >Optimal and Adaptive Off-policy Evaluation in Contextual Bandits

【24h】

Optimal and Adaptive Off-policy Evaluation in Contextual Bandits

机译：上下文匪徒的最佳和适应性脱助政策评估

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

We study the off-policy evaluation problem—estimating the value of a target policy using data collected by another policy—under the contextual bandit model. We consider the general (agnostic) setting without access to a consistent model of rewards and establish a minimax lower bound on the mean squared error (MSE). The bound is matched up to constants by the inverse propensity scoring (IPS) and doubly robust (DR) estimators. This highlights the difficulty of the agnostic contextual setting, in contrast with multi-armed bandits and contextual bandits with access to a consistent reward model, where IPS is suboptimal. We then propose the SWITCH estimator, which can use an existing reward model (not necessarily consistent) to achieve a better bias-variance tradeoff than IPS and DR. We prove an upper bound on its MSE and demonstrate its benefits empirically on a diverse collection of datasets, often outperforming prior work by orders of magnitude.

机译：我们研究了违规评估问题 - 使用由另一个策略收集的数据在上下文强盗模型下估算目标策略的值。我们考虑一般（不可知论者）设置，而无需访问一致的奖励模型，并在平均方形错误（MSE）上建立最小限制。该界限与逆倾向评分（IP）和双重稳健（DR）估计的常量匹配。这突出显示不可知论的上下文设置的难度，与多武装匪和上下文匪徒相比，可以访问一致的奖励模型，其中IP是次优。然后，我们提出了交换机估计器，它可以使用现有的奖励模型（不一定是一致的）来实现比IPS和DR更好的偏差方差权衡。我们在其MSE上证明了一个上限，并在经验上展示其在各种数据集收集上的益处，通常优于数量级的先前工作。

著录项

来源
《JMLR: Workshop and Conference Proceedings》 |2017年第2010期|共9页
作者
Yu-Xiang Wang; Alekh Agarwal; Miroslav Dud??k;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Adaptive metamorphic testing with contextual bandits [J] . Helge Spieker, Arnaud Gotlieb The Journal of Systems and Software . 2020,第Jula期

机译：具有语境匪徒的自适应变质测试
2. Contextual Bandits with Continuous Actions: Smoothing, Zooming, and Adapting [J] . Akshay Krishnamurthy, John Langford, Aleksandrs Slivkins, Journal of machine learning research . 2020,第a期

机译：具有连续动作的上下文匪徒：平滑，缩放和调整
3. Asymptotically Optimal Contextual Bandit Algorithm Using Hierarchical Structures [J] . Neyshabouri Mohammadreza Mohaghegh, Gokcesu Kaan, Gokcesu Hakan, Neural Networks and Learning Systems, IEEE Transactions on . 2019,第3期

机译：层次结构的渐近最优上下文强盗算法
4. Optimal and Adaptive Off-policy Evaluation in Contextual Bandits [C] . Yu-Xiang Wang, Alekh Agarwal, Miroslav Dudik International Conference on Machine Learning . 2018

机译：上下文匪徒的最佳和适应性脱助政策评估
5. Adaptive Preference Learning with Bandit Feedback: Information Filtering, Dueling Bandits and Incentivizing Exploration [D] . Chen, Bangrui. 2017

机译：带有土匪反馈的自适应偏好学习：信息过滤，决斗土匪和激励探索
6. Spherical aberration yielding optimum visual performance: Evaluation of intraocular lenses using adaptive optics simulation [O] . John S. Werner, Sarah L. Elliott, Stacey S. Choi, -1

机译：球面像差产生最佳视觉性能：使用自适应光学模拟眼内镜片评价
7. Adaptive metamorphic testing with contextual bandits [O] . Helge Spieker, Arnaud Gotlieb 2020

机译：具有语境匪徒的自适应变质测试
8. Optimality Self Online Monitoring (OSOM) for Performance Evaluation and Adaptive Sensor Fusion [R] . Yang, C., Blasch, E., Kadar, I. 2008

机译：用于性能评估和自适应传感器融合的最优性自在线监测（OsOm）

Optimal and Adaptive Off-policy Evaluation in Contextual Bandits

摘要

著录项

相似文献

相关主题

期刊订阅