首页> 外文会议>Conference on Neural Information Processing Systems >Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning
【24h】

Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning

机译:对强化学习的本质上有效,稳定和有界的截止政策评估

获取原文

摘要

Off-policy evaluation (OPE) in both contextual bandits and reinforcement learning allows one to evaluate novel decision policies without needing to conduct exploration, which is often costly or otherwise infeasible. The problem's importance has attracted many proposed solutions, including importance sampling (IS), self-normalized IS (SNIS), and doubly robust (DR) estimates. DR and its variants ensure semiparametric local efficiency if Q-functions are well-specified, but if they are not they can be worse than both IS and SNIS. It also does not enjoy SNIS's inherent stability and boundedness. We propose new estimators for OPE based on empirical likelihood that are always more efficient than IS, SNIS, and DR and satisfy the same stability and boundedness properties as SNIS. On the way, we categorize various properties and classify existing estimators by them. Besides the theoretical guarantees, empirical studies suggest the new estimators provide advantages.
机译:在环境匪徒和强化学习中的违规政策评估(OPE)允许人们在不需要进行探索的情况下评估新的决策政策,这通常是昂贵或以其他方式不可行的。 问题的重要性吸引了许多提出的解决方案,包括重要性采样(IS),自归一化为(SNI),以及双重稳健(DR)估计。 DR及其变体确保了Semiparametric本地效率,如果Q函数被批准,但如果它们不是它们比两者都更糟糕。 它还不享受SNIS的固有稳定性和界限。 我们提出了基于始终高效,SNI和DR的实证可能性的新估算器,并满足与SNIS相同的稳定性和界限属性。 在途中,我们对各种属性进行分类并按它们对现有估算进行分类。 除了理论保证外,实证研究表明新的估算器提供了优势。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号