Multi-objective Discounted Reward Verification in Graphs and MDPs

机译：图形和MDP中的多目标折扣奖励验证

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

We study the problem of achieving a given value in Markov decision processes (MDPs) with several independent discounted reward objectives. We consider a generalised version of discounted reward objectives, in which the amount of discounting depends on the states visited and on the objective. This definition extends the usual definition of discounted reward, and allows to capture the systems in which the value of different commodities diminish at different and variable rates. We establish results for two prominent subclasses of the problem, namely state-discount models where the discount factors are only dependent on the state of the MDP (and independent of the objective), and reward-discount models where they are only dependent on the objective (but not on the state of the MDP). For the state-discount models we use a straightforward reduction to expected total reward and show that the problem whether a value is achievable can be solved in polynomial time. For the reward-discount model we show that memory and randomisation of the strategies are required, but nevertheless that the problem is decidable and it is sufficient to consider strategies which after a certain number of steps behave in a memoryless way. For the general case, we show that when restricted to graphs (i.e. MDPs with no randomisation), pure strategies and discount factors of the form 1 where n is an integer, the problem is in PSPACE and finite memory suffices for achieving a given value. We also show that when the discount factors are not of the form 1, the memory required by a strategy can be infinite.

机译：我们研究了在马尔可夫决策过程（MDP）中具有几个独立的折现奖励目标的实现给定值的问题。我们考虑打折奖励目标的广义版本，其中打折的数量取决于所访问的州和目标。该定义扩展了折现奖励的通常定义，并允许捕获不同商品以不同且可变的汇率贬值的系统。我们为问题的两个主要子类建立了结果，即折扣模型仅取决于MDP的状态（且与目标无关）的状态折扣模型，以及仅取决于目标的奖励折扣模型。（但不在MDP的状态上）。对于状态折扣模型，我们使用直接减少期望的总奖励的方法，并表明可以在多项式时间内解决值是否可实现的问题。对于奖励折扣模型，我们表明需要对策略进行记忆和随机化，但是，这个问题是可以确定的，并且考虑经过一定步骤后以无记忆方式表现的策略就足够了。对于一般情况，我们表明，当限于图（即无随机化的MDP），形式为1 / n的纯策略和折现因子（其中n是整数）时，问题在于PSPACE，并且有限内存足以实现给定价值。我们还表明，当折现因子的形式不为1 / n时，策略所需的内存可能是无限的。

著录项

来源
《International conference on logic for programming, artificial intelligence, and reasoning》|2013年|228-242|共15页
会议地点
作者
Krishnendu Chatterjee; Vojtech Forejt; Dominik Wojtczak;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Variance-constrained actor-critic algorithms for discounted and average reward MDPs [J] . Prashanth L. A., Ghavamzadeh Mohammad Machine Learning . 2016,第3期

机译：折扣和平均奖励MDP的方差约束演员批评算法
2. Trading Later Rewards for Current Pleasure: Pornography Consumption and Delay Discounting [J] . Negash Sesen, Sheppard Nicole Van Ness, Lambert Nathaniel M., Journal of sex research . 2016,第6期

机译：交易以后的奖励以获得当前的乐趣：色情消费和延迟贴现
3. Inter-individual discount factor differences in reward prediction are topographically associated with caudate activation. [J] . Onoda K, Okamoto Y, Kunisato Y, Experimental Brain Research . 2011,第4期

机译：奖励预测中的个体间折扣因子差异与尾状激活在地形上相关。
4. Multi-objective Discounted Reward Verification in Graphs and MDPs [C] . Krishnendu Chatterjee, Vojtech Forejt, Dominik Wojtczak International Conference on Logic for Programming, Artificial Intelligence, and Reasoning . 2013

机译：图形和MDP中的多目标折扣奖励验证
5. The Effects of Values Activation on Temptation Coping and Confidence: Testing Delayed Reward Discounting and Religiosity/Spirituality as Moderators [D] . Varma, Malini 2020

机译：价值激活对诱惑应对和自信心的影响：以主持人身份测试延迟奖励折扣和宗教/灵性
6. Perceived food palatability blood glucose level and future discounting: Lack of evidence for blood glucose level’s impact on reward discounting [O] . Rafał Muda, Przemysław Sawicki, Michał Ginszt 2021

机译：感知食物适口性血糖水平和未来折扣：缺乏血糖水平对奖励折扣的影响的证据
7. Variance-constrained actor-critic algorithms for discounted and average reward MDPs [O] . L. A. Prashanth, Mohammad Ghavamzadeh 2016

机译：折扣和平均奖励MDP的差异约束演员批评算法
8. Framing Reinforcement Learning from Human Reward: Reward Positivity, Temporal Discounting, Episodicity, and Performance. [R] . Knox, W. B., Stone, P. 2014

机译：从人类奖励中学习强化学习：奖励积极性，时间贴现，情节性和表现。

Multi-objective Discounted Reward Verification in Graphs and MDPs

摘要

著录项

相似文献

相关主题

期刊订阅