首页> 外文会议>International Conference on Machine Learning >More Efficient Off-Policy Evaluation through Regularized Targeted Learning
【24h】

More Efficient Off-Policy Evaluation through Regularized Targeted Learning

机译:通过正常化的目标学习更加有效的截止政策评估

获取原文

摘要

We study the problem of off-policy evaluation (OPE) in Reinforcement Learning (RL), where the aim is to estimate the performance of a new policy given historical data that may have been generated by a different policy, or policies. In particular, we introduce a novel doubly-robust estimator for the OPE problem in RL, based on the Targeted Maximum Likelihood Estimation principle from the statistical causal inference literature. We also introduce several variance reduction techniques that lead to impressive performance gains in off-policy evaluation. We show empirically that our estimator uniformly wins over existing off-policy evaluation methods across multiple RL environments and various levels of model misspecification. Finally, we further the existing theoretical analysis of estimators for the RL off-policy estimation problem by showing their O_P(1/√n) rate of convergence and characterizing their asymptotic distribution.
机译:我们研究了加强学习(RL)中的违规政策评估(OPE)的问题,目的是估计新政策的表现给出了可能由不同的政策或政策产生的历史数据。特别是,基于来自统计因果推理文献的目标最大似然估计原理,我们在RL中介绍了一种用于OPE问题的新型稳健估计。我们还介绍了几种差异减少技术,导致违规评估令人印象深刻的性能。我们凭经验展示我们的估算器跨越多个RL环境以及各种级别的模型拼盘均匀赢得了现有的脱助政策评估方法。最后,我们进一步通过显示其O_P(1 /√N)的收敛率并表征其渐近分布,进一步了解RL脱离政策估算问题的估计的理论分析。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号