首页> 外文会议> >Two stochastic dynamic programming problems by model-free actor-critic recurrent-network learning in non-Markovian settings
【24h】

Two stochastic dynamic programming problems by model-free actor-critic recurrent-network learning in non-Markovian settings

机译:非马尔可夫环境下无模型参与者批评递归网络学习的两个随机动态规划问题

获取原文

摘要

We describe two stochastic non-Markovian dynamic programming (DP) problems, showing how the posed problems can be attacked by using actor-critic reinforcement learning with recurrent neural networks (RNN). We assume that the current state of a dynamical system is "completely observable", but that the rules, unknown to our decision-making agent, for the current reward and state transition depend not only on current state and action, but on possibly the "entire history" of past states and actions. This should not be confused with problems of "partially observable Markov decision processes (POMDPs)", where the current state is only deduced from either partial (observable) state alone or error-corrupted observations. Our actor-critic RNN agent is capable of finding an optimal policy, while learning neither transitional probabilities, associated rewards, nor by how much the current state space must be augmented so that the Markov property holds. The RNN's recurrent connections or context units function as an "implicit" history memory (or internal state) to develop "sensitivity" to non-Markovian dependencies, rendering the process Markovian implicitly and automatically in a "totally model-free" fashion. In particular, using two small-scale longest-path problems in a stochastic non-Markovian setting, we discuss model-free learning features in comparison with the model-based approach by the classical DP algorithm.
机译:我们描述了两个随机非马尔可夫动态规划(DP)问题,显示了如何通过使用具有递归神经网络(RNN)的行为者批判强化学习来解决所提出的问题。我们假设动力学系统的当前状态是“完全可观察的”,但是对于我们的决策主体而言,当前奖励和状态转换所不遵循的规则不仅取决于当前状态和行为,而且还可能取决于“过去状态和行动的完整历史”。这不应与“部分可观察的马尔可夫决策过程(POMDP)”问题相混​​淆,在该问题中,仅从单独的部分(可观察)状态或错误损坏的观察结果中推断出当前状态。我们的演员批判RNN代理能够找到最佳策略,同时既不学习过渡概率,相关奖励,也不学习必须增加多少当前状态空间以保持Markov属性。 RNN的循环连接或上下文单元充当“隐式”历史记录记忆(或内部状态),以发展对非马尔可夫依赖项的“敏感性”,以“完全无模型”的方式隐式自动地呈现过程马尔可夫。特别是,在随机非马尔可夫设置中使用两个小尺度最长路径问题,与经典DP算法的基于模型的方法相比,我们讨论了无模型的学习特征。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号