首页> 外文会议>Conference on Uncertainty in Artificial Intelligence >Comparing Direct and Indirect Temporal-Difference Methods for Estimating the Variance of the Return
【24h】

Comparing Direct and Indirect Temporal-Difference Methods for Estimating the Variance of the Return

机译:比较直接和间接的时间差异方法来估算返回的方差

获取原文
获取外文期刊封面目录资料

摘要

Temporal-difference (TD) learning methods are widely used in reinforcement learning to estimate the expected return for each state, without a model, because of their significant advantages in computational and data efficiency. For many applications involving risk mitigation, it would also be useful to estimate the variance of the return by TD methods. In this paper, we describe a way of doing this that is substantially simpler than those proposed by Tamar, Di Castro, and Mannor in 2012, or those proposed by White and White in 2016. We show that two TD learners operating in series can learn expectation and variance estimates. The trick is to use the square of the TD error of the expectation learner as the reward of the variance learner, and the square of the expectation learner's discount rate as the discount rate of the variance learner. With these two modifications, the variance learning problem becomes a conventional TD learning problem to which standard theoretical results can be applied. Our formal results are limited to the table lookup case, for which our method is still novel, but the extension to function approximation is immediate, and we provide some empirical results for the linear function approximation case. Our experimental results show that our direct method behaves just as well as a comparable indirect method, but is generally more robust.
机译:时间差(TD)学习方法广泛用于加强学习,以估计每个状态的预期返回,而无需模型,因为它们在计算和数据效率方面的显着优势。对于涉及风险缓解的许多应用,估计TD方法返回的差异也很有用。在本文中,我们描述了一种方法,这是比2012年Tamar,Di Castro和Mannor提出的那些基本上更简单的方式,或者在2016年通过白色和白色提出的方式。我们表明两个TD学习者串行运营的人可以学习期望和方差估计。诀窍是将期望学习者的TD错误的广场作为方差学习者的奖励,以及预期学习者的折扣率作为方差学习者的折扣率。利用这两个修改,方差学习问题成为可以应用标准理论结果的传统TD学习问题。我们的正式结果仅限于表查找案例,我们的方法仍然是新颖的,但函数近似的扩展是立即的,我们为线性函数近似情况提供了一些经验结果。我们的实验结果表明,我们的直接方法的行为方式也是可比的间接方法,但通常更强大。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号