首页> 外文会议>Machine learning >To Discount or not to Discount in Reinforcement Learning: A Case Study Comparing R Learning and Q Learning
【24h】

To Discount or not to Discount in Reinforcement Learning: A Case Study Comparing R Learning and Q Learning

机译:强化学习中要折扣还是不折扣:R学习和Q学习比较的案例研究

获取原文
获取原文并翻译 | 示例

摘要

Most work in reinforcement learning (RL) is based on discounted techniques, such as Q learning, where long-term rewards are geometrically attenuated based on the delay in their occurence. Schwartz recently proposed an undiscounted RL technique called R learning that optimizes average reward, and argued that it was a better metric than the discounted one optimized by Q learning. In this paper we compare R learning with Q learning on a simulated robot box-pushing task. We compare these two techniques across three different exploration strategies: two of them undirected, Boltz-mann and semi-uniform, and one recency-based directed strategy. Our results show that Q learning performs better than R learning, even when both are evaluated using the same undiscounted performance measure. Furthermore, R learning appears to be very sensitive to choice of exploration strategy. In particular, a surprising result is that R learning's performance noticeably deteriorates under Boltzmann exploration. We identify precisely a limit cycle situation that causes R learning's performance to deteriorate when combined with Boltzmann exploration, and show where such limit cycles arise in our robot task. However, R learning performs much better (although not as well as Q learning) when combined with semi-uniform and recency-based exploration. In this paper, we also argue for using medians over means as a better distribution-free estimator of average performance, and describe a simple non-parametric significance test for comparing learning data from two RL techniques.
机译:强化学习(RL)的大多数工作都是基于折扣技术,例如Q学习,其中长期奖励会根据其出现的延迟而在几何上减弱。 Schwartz最近提出了一种最优惠的RL技术,称为R学习,它可以优化平均奖励,并认为它比Q学习优化的折价方法更好。在本文中,我们将R学习与Q学习在模拟的机器人推箱子任务中进行了比较。我们在三种不同的勘探策略中比较了这两种技术:两种是非定向的,玻尔兹曼和半均匀的,一种是基于新近度的定向策略。我们的结果表明,即使使用相同的非折衷绩效指标对Q学习进行评估,Q学习也比R学习表现更好。此外,R学习似乎对探索策略的选择非常敏感。尤其令人惊讶的结果是,在Boltzmann探索下,R学习的性能明显下降。我们精确地确定了一个极限环情况,当与Boltzmann探索结合使用时,会导致R学习的性能下降,并说明这种极限环在机器人任务中出现的位置。但是,与半均匀和基于新近度的探索相结合时,R学习的效果要好得多(尽管不如Q学习)。在本文中,我们还主张使用均值中位数作为更好的平均表现的无分布估计量,并描述一种简单的非参数显着性检验,用于比较两种RL技术的学习数据。

著录项

  • 来源
    《Machine learning》|1994年|164-172|共9页
  • 会议地点 New Brunswick NJ(US);New Brunswick NJ(US)
  • 作者

    Sridhar Mahadevan;

  • 作者单位

    Department of Computer Science and Engineering University of South Florida Tampa, Florida 33620;

  • 会议组织
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 计算机的应用;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号