首页> 外文会议>Machine learning >To Discount or not to Discount in Reinforcement Learning: A Case Study Comparing R Learning and Q Learning

【24h】

To Discount or not to Discount in Reinforcement Learning: A Case Study Comparing R Learning and Q Learning

机译：强化学习中要折扣还是不折扣：R学习和Q学习比较的案例研究

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Most work in reinforcement learning (RL) is based on discounted techniques, such as Q learning, where long-term rewards are geometrically attenuated based on the delay in their occurence. Schwartz recently proposed an undiscounted RL technique called R learning that optimizes average reward, and argued that it was a better metric than the discounted one optimized by Q learning. In this paper we compare R learning with Q learning on a simulated robot box-pushing task. We compare these two techniques across three different exploration strategies: two of them undirected, Boltz-mann and semi-uniform, and one recency-based directed strategy. Our results show that Q learning performs better than R learning, even when both are evaluated using the same undiscounted performance measure. Furthermore, R learning appears to be very sensitive to choice of exploration strategy. In particular, a surprising result is that R learning's performance noticeably deteriorates under Boltzmann exploration. We identify precisely a limit cycle situation that causes R learning's performance to deteriorate when combined with Boltzmann exploration, and show where such limit cycles arise in our robot task. However, R learning performs much better (although not as well as Q learning) when combined with semi-uniform and recency-based exploration. In this paper, we also argue for using medians over means as a better distribution-free estimator of average performance, and describe a simple non-parametric significance test for comparing learning data from two RL techniques.

机译：强化学习（RL）的大多数工作都是基于折扣技术，例如Q学习，其中长期奖励会根据其出现的延迟而在几何上减弱。 Schwartz最近提出了一种最优惠的RL技术，称为R学习，它可以优化平均奖励，并认为它比Q学习优化的折价方法更好。在本文中，我们将R学习与Q学习在模拟的机器人推箱子任务中进行了比较。我们在三种不同的勘探策略中比较了这两种技术：两种是非定向的，玻尔兹曼和半均匀的，一种是基于新近度的定向策略。我们的结果表明，即使使用相同的非折衷绩效指标对Q学习进行评估，Q学习也比R学习表现更好。此外，R学习似乎对探索策略的选择非常敏感。尤其令人惊讶的结果是，在Boltzmann探索下，R学习的性能明显下降。我们精确地确定了一个极限环情况，当与Boltzmann探索结合使用时，会导致R学习的性能下降，并说明这种极限环在机器人任务中出现的位置。但是，与半均匀和基于新近度的探索相结合时，R学习的效果要好得多（尽管不如Q学习）。在本文中，我们还主张使用均值中位数作为更好的平均表现的无分布估计量，并描述一种简单的非参数显着性检验，用于比较两种RL技术的学习数据。

著录项

来源
《Machine learning》|1994年|164-172|共9页
会议地点 New Brunswick NJ(US);New Brunswick NJ(US)
作者
Sridhar Mahadevan;
展开▼
作者单位

Department of Computer Science and Engineering University of South Florida Tampa, Florida 33620;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类计算机的应用;
关键词

相似文献

外文文献
中文文献
专利

1. Reinforcement learning for discounted values often loses the goal in the application to animal learning [J] . Yoshiya Yamaguchi, Yutaka Sakai Neural Networks: The Official Journal of the International Neural Network Society . 2012,第Nova期

机译：折扣价值的强化学习常常失去了在动物学习中的应用目标
2. Framing reinforcement learning from human reward: Reward positivity, temporal discounting, episodicity, and performance [J] . W. Bradley Knox, Peter Stone Artificial intelligence . 2015,第auga期

机译：从人的奖励中构筑强化学习：奖励积极性，暂时性打折，流行和表现
3. Assigning Discounts In A Marketing Campaign By Using Reinforcement Learning And Neural Networks [J] . Gabriel Gomez-Perez, Jose D. Martin-Guerrero, Emilio Soria-Olivas, Expert systems with applications . 2009,第4期

机译：使用强化学习和神经网络在营销活动中分配折扣
4. Sensitive Discount Optimality: Unifying Discounted and Average Reward Reinforcement Learning [C] . Sridhar Mahadevan Machine learning . 1996

机译：敏感性折扣最优：统一折扣和平均奖励强化学习
5. Reinforcement Learning and Recurrent Reinforcement Learning for Dynamic Portfolio Optimization [D] . Almahdi, Saud 2019

机译：强化学习和循环强化学习以实现动态资产组合优化
6. Does temporal discounting explain unhealthy behavior? A systematic review and reinforcement learning perspective [O] . Giles W. Story, Ivo Vlaev, Ben Seymour, 2014

机译：时间折扣可以解释不健康的行为吗？系统的回顾和强化学习的视角
7. To Discount or not to Discount in Reinforcement Learning: A Case Study Comparing R Learning and Q Learning [O] . Sridhar Mahadevan 1994

机译：强化学习中要折扣还是不折扣：R学习和Q学习比较的案例研究
8. Framing Reinforcement Learning from Human Reward: Reward Positivity, Temporal Discounting, Episodicity, and Performance. [R] . Knox, W. B., Stone, P. 2014

机译：从人类奖励中学习强化学习：奖励积极性，时间贴现，情节性和表现。

To Discount or not to Discount in Reinforcement Learning: A Case Study Comparing R Learning and Q Learning

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅