首页> 外文会议>Machine learning >Sensitive Discount Optimality: Unifying Discounted and Average Reward Reinforcement Learning

【24h】

Sensitive Discount Optimality: Unifying Discounted and Average Reward Reinforcement Learning

机译：敏感性折扣最优：统一折扣和平均奖励强化学习

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Thus far, research in reinforcement learning (RL) has concentrated on two optimality criteria: the discounted framework, which has been very well-studied, and the average-reward framework, in which interest is rapidly increasing. This paper presents a framework called sensitive discount optimality which offers an elegant way of linking these two paradigms. Although sensitive discount optimality has been well studied in dynamic programming, with several provably convergent algorithms, it has not received any attention in RL. This framework is based on studying the properties of the expected cumulative discounted reward, as discounting tends to 1. Under these conditions, the cumulative discounted reward can be expanded using a Laurent series expansion to yields a sequence of terms, the first of which is the average reward, the second involves the average adjusted sum of rewards (or bias), etc. We use the sensitive discount optimality framework to derive a new model-free average reward technique, which is related to Q-learning type methods proposed by Bert-sekas, Schwartz, and Singh, but which unlike these previous methods, optimizes both the first and second terms in the Laurent series (average reward and bias values).

机译：迄今为止，强化学习（RL）的研究集中在两个最优标准上：经过深入研究的折价框架和兴趣迅速增长的平均奖励框架。本文提出了一个称为敏感折扣最优的框架，该框架提供了将这两种范式联系起来的一种优雅方式。尽管在动态规划中已经对灵敏的折扣最优性进行了充分的研究，并且使用了几种可证明的收敛算法，但它在RL中并未引起任何关注。该框架基于研究预期的累计折现奖励的特性，因为折现趋向于1。在这种情况下，可以使用Laurent级数展开来扩展累积折现奖励，以产生一系列的项，第一个是平均奖励，第二个涉及平均调整后的奖励总和（或偏差）等。我们使用敏感的折扣最优框架来推导一种新的无模型平均奖励技术，该技术与Bert-提出的Q学习类型方法有关sekas，Schwartz和Singh，但是与这些以前的方法不同，它们优化了Laurent系列中的第一项和第二项（平均奖励和偏差值）。

著录项

来源
《Machine learning》|1996年|328-336|共9页
会议地点 Bari(IT);Bari(IT)
作者
Sridhar Mahadevan;
展开▼
作者单位

Department of Computer Science and Engineering University of South Florida Tampa, Florida 33620;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类计算机的应用;
关键词

相似文献

外文文献
中文文献
专利

1. Framing reinforcement learning from human reward: Reward positivity, temporal discounting, episodicity, and performance [J] . W. Bradley Knox, Peter Stone Artificial intelligence . 2015,第auga期

机译：从人的奖励中构筑强化学习：奖励积极性，暂时性打折，流行和表现
2. The vanishing discount approach to average reward optimality: the strongly and the weakly continuous cases [J] . aacute, eacute, ndez-Lerma, Morfismos . 2008,第2期

机译：消失的折衷方法可实现平均奖励最优：强和弱连续案例
3. Contraction conditions for average and alpha-discount optimality in countable state Markov games with unbounded rewards [J] . Altman E, Hordijk A, Spieksma FM Mathematics of operations research . 1997,第3期

机译：具有无穷奖励的可数状态Markov游戏中平均和alpha折扣最优的收缩条件
4. A unified approach for semi-Markov decision processes with discounted and average reward criteria [C] . Yanjie Li, Huijing Wang, Haoyao Chen World Congress on Intelligent Control and Automation . 2014

机译：具有折扣和平均奖励标准的半马尔可夫决策过程的统一方法
5. The Effects of Values Activation on Temptation Coping and Confidence: Testing Delayed Reward Discounting and Religiosity/Spirituality as Moderators [D] . Varma, Malini 2020

机译：价值激活对诱惑应对和自信心的影响：以主持人身份测试延迟奖励折扣和宗教/灵性
6. Does temporal discounting explain unhealthy behavior? A systematic review and reinforcement learning perspective [O] . Giles W. Story, Ivo Vlaev, Ben Seymour, 2014

机译：时间折扣可以解释不健康的行为吗？系统的回顾和强化学习的视角
7. Framing Reinforcement Learning from Human Reward: Reward Positivity, Temporal Discounting, Episodicity, and Performance [O] . W. Bradley Knox, Peter Stone 2015

机译：从人类奖励中学习强化学习：奖励积极性，时间贴现，情节性和表现
8. Framing Reinforcement Learning from Human Reward: Reward Positivity, Temporal Discounting, Episodicity, and Performance. [R] . Knox, W. B., Stone, P. 2014

机译：从人类奖励中学习强化学习：奖励积极性，时间贴现，情节性和表现。

Sensitive Discount Optimality: Unifying Discounted and Average Reward Reinforcement Learning

摘要

著录项

相似文献

相关主题

期刊订阅