首页> 外文会议>AAAI Conference on Artificial Intelligence >State-Augmentation Transformations for Risk-Sensitive Reinforcement Learning
【24h】

State-Augmentation Transformations for Risk-Sensitive Reinforcement Learning

机译:风险敏感强化学习的国有增强变换

获取原文

摘要

In the framework of MDP, although the general reward function takes three arguments-current state, action, and successor state; it is often simplified to a function of two arguments-current state and action. The former is called a transition-based reward function, whereas the latter is called a state-based reward function. When the objective involves the expected total reward only, this simplification works perfectly. However, when the objective is risk-sensitive, this simplification leads to an incorrect value. We propose three successively more general state-augmentation transformations (SATs), which preserve the reward sequences as well as the reward distributions and the optimal policy in risk-sensitive reinforcement learning. In risk-sensitive scenarios, firstly we prove that, for every MDP with a stochastic transition-based reward function, there exists an MDP with a deterministic state-based reward function, such that for any given (randomized) policy for the first MDP, there exists a corresponding policy for the second MDP, such that both Markov reward processes share the same reward sequence. Secondly we illustrate that two situations require the proposed SATs in an inventory control problem. One could be using Q-learning (or other learning methods) on MDPs with transition-based reward functions, and the other could be using methods, which are for the Markov processes with a deterministic state-based reward functions, on the Markov processes with general reward functions. We show the advantage of the SATs by considering Value-at-Risk as an example, which is a risk measure on the reward distribution instead of the measures (such as mean and variance) of the distribution. We illustrate the error in the reward distribution estimation from the reward simplification, and show how the SATs enable a variance formula to work on Markov processes with general reward functions.
机译:在MDP的框架中,虽然常规奖励函数需要三个参数 - 当前状态,动作和继承状态;它通常简化为两个参数 - 当前状态和动作的函数。前者被称为基于转换的奖励功能,而后者称为基于状态的奖励函数。当目标仅涉及预期的总奖励时,这种简化完全相同。但是,当目标是风险敏感的时,这种简化会导致错误的值。我们提出了三个先后更加一般的国有增强转换(SAT),其保留了奖励序列以及奖励分布以及风险敏感强化学习中的最佳政策。在风险敏感方案中,首先,我们证明,对于具有随机转换的奖励函数的每个MDP,存在具有确定性状态的奖励函数的MDP,使得对于第一个MDP的任何给定(随机)策略,第二个MDP存在相应的策略,使得Markov奖励进程均为相同的奖励顺序。其次,我们说明了两种情况需要提议的SAT在库存控制问题中。可以在MDP上使用基于转换的奖励函数的Q-Learning(或其他学习方法),另一个可以使用用于Markov进程的Markov进程的Markov进程的方法一般奖励功能。我们通过考虑价值风险来展示SAT的优势,这是奖励分配的风险措施,而不是分布的措施(例如均值和方差)。我们从奖励简化中说明了奖励分布估计中的错误,并展示SATS如何使方差公式能够在具有一般奖励功能的Markov进程上工作。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号