首页> 外文会议>Conference on uncertainty in artificial intelligence >Sample-efficient Nonstationary Policy Evaluation for Contextual Bandits
【24h】

Sample-efficient Nonstationary Policy Evaluation for Contextual Bandits

机译:情景强盗的样本有效非平稳策略评估

获取原文

摘要

We present and prove properties of a new offline policy evaluator for an exploration learning setting which is superior to previous evaluators. In particular, it simultaneously and correctly incorporates techniques from importance weighting, doubly robust evaluation, and nonstationary policy evaluation approaches. In addition, our approach allows generating longer histories by careful control of a bias-variance tradeoff, and further decreases variance by incorporating information about randomness of the target policy. Empirical evidence from synthetic and real-world exploration learning problems shows the new evaluator successfully unifies previous approaches and uses information an order of magnitude more efficiently.
机译:我们介绍并证明了用于探索学习设置的新脱机策略评估器的属性,该属性优于以前的评估器。特别是,它同时正确地结合了重要性加权,双重稳健评估和非平稳政策评估方法等技术。此外,我们的方法允许通过谨慎地控制偏差方差折衷来生成更长的历史记录,并通过合并有关目标策略随机性的信息来进一步减少方差。来自综合和实际探索学习问题的经验证据表明,新评估人员成功地统一了以前的方法,并更有效地使用了一个数量级的信息。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号