Sample-efficient Nonstationary Policy Evaluation for Contextual Bandits

机译：情景强盗的样本有效非平稳策略评估

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

We present and prove properties of a new offline policy evaluator for an exploration learning setting which is superior to previous evaluators. In particular, it simultaneously and correctly incorporates techniques from importance weighting, doubly robust evaluation, and nonstationary policy evaluation approaches. In addition, our approach allows generating longer histories by careful control of a bias-variance tradeoff, and further decreases variance by incorporating information about randomness of the target policy. Empirical evidence from synthetic and real-world exploration learning problems shows the new evaluator successfully unifies previous approaches and uses information an order of magnitude more efficiently.

机译：我们介绍并证明了用于探索学习设置的新脱机策略评估器的属性，该属性优于以前的评估器。特别是，它同时正确地结合了重要性加权，双重稳健评估和非平稳政策评估方法等技术。此外，我们的方法允许通过谨慎地控制偏差方差折衷来生成更长的历史记录，并通过合并有关目标策略随机性的信息来进一步减少方差。来自综合和实际探索学习问题的经验证据表明，新评估人员成功地统一了以前的方法，并更有效地使用了一个数量级的信息。

著录项

来源
《Conference on uncertainty in artificial intelligence》|2012年|247-254|共8页
会议地点
作者
Miroslav Dudik; Dumitru Erhan; John Langford; Lihong Li;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Optimal and Adaptive Off-policy Evaluation in Contextual Bandits [J] . Yu-Xiang Wang, Alekh Agarwal, Miroslav Dud??k JMLR: Workshop and Conference Proceedings . 2017,第2010期

机译：上下文匪徒的最佳和适应性脱助政策评估
2. Generalized Policy Elimination an efficient algorithm for Nonparametric Contextual Bandits [J] . Aurelien Bibaut, Antoine Chambaz, Mark Laan JMLR: Workshop and Conference Proceedings . 2020,第2010期

机译：广义政策消除非参数上下文匪徒的有效算法
3. Linear Bayes policy for learning in contextual-bandits [J] . Jose Antonio Martin H., Ana M. Vargas Expert Systems with Application . 2013,第18期

机译：线性贝叶斯策略在强盗中学习
4. Sample-efficient Nonstationary Policy Evaluation for Contextual Bandits [C] . Miroslav Dudik, Dumitru Erhan, John Langford, Conference on Uncertainty in Artificial Intelligence . 2012

机译：用于上下文匪徒的采样高效的非间断策略评估
5. Sample-Efficient Reinforcement Learning of Robot Control Policies in the Real World [D] . Luck, Kevin Sebastian. 2019

机译：现实世界中机器人控制政策的采样高效的强化学习
6. Action Centered Contextual Bandits [O] . Kristjan Greenewald, Ambuj Tewari, Predrag Klasnja, -1

机译：行动为中心的情境强盗
7. Linear Bayes policy for learning in contextual-bandits [O] . Vargas Perez Ana Maria, Martín H. José Antonio 2013

机译：线性贝叶斯策略在强盗中学习

Sample-efficient Nonstationary Policy Evaluation for Contextual Bandits

摘要

著录项

相似文献

相关主题

期刊订阅