首页> 外文会议> >Two stochastic dynamic programming problems by model-free actor-critic recurrent-network learning in non-Markovian settings

【24h】

Two stochastic dynamic programming problems by model-free actor-critic recurrent-network learning in non-Markovian settings

机译：非马尔可夫环境下无模型参与者批评递归网络学习的两个随机动态规划问题

获取原文

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

We describe two stochastic non-Markovian dynamic programming (DP) problems, showing how the posed problems can be attacked by using actor-critic reinforcement learning with recurrent neural networks (RNN). We assume that the current state of a dynamical system is "completely observable", but that the rules, unknown to our decision-making agent, for the current reward and state transition depend not only on current state and action, but on possibly the "entire history" of past states and actions. This should not be confused with problems of "partially observable Markov decision processes (POMDPs)", where the current state is only deduced from either partial (observable) state alone or error-corrupted observations. Our actor-critic RNN agent is capable of finding an optimal policy, while learning neither transitional probabilities, associated rewards, nor by how much the current state space must be augmented so that the Markov property holds. The RNN's recurrent connections or context units function as an "implicit" history memory (or internal state) to develop "sensitivity" to non-Markovian dependencies, rendering the process Markovian implicitly and automatically in a "totally model-free" fashion. In particular, using two small-scale longest-path problems in a stochastic non-Markovian setting, we discuss model-free learning features in comparison with the model-based approach by the classical DP algorithm.

机译：我们描述了两个随机非马尔可夫动态规划（DP）问题，显示了如何通过使用具有递归神经网络（RNN）的行为者批判强化学习来解决所提出的问题。我们假设动力学系统的当前状态是“完全可观察的”，但是对于我们的决策主体而言，当前奖励和状态转换所不遵循的规则不仅取决于当前状态和行为，而且还可能取决于“过去状态和行动的完整历史”。这不应与“部分可观察的马尔可夫决策过程（POMDP）”问题相混淆，在该问题中，仅从单独的部分（可观察）状态或错误损坏的观察结果中推断出当前状态。我们的演员批判RNN代理能够找到最佳策略，同时既不学习过渡概率，相关奖励，也不学习必须增加多少当前状态空间以保持Markov属性。 RNN的循环连接或上下文单元充当“隐式”历史记录记忆（或内部状态），以发展对非马尔可夫依赖项的“敏感性”，以“完全无模型”的方式隐式自动地呈现过程马尔可夫。特别是，在随机非马尔可夫设置中使用两个小尺度最长路径问题，与经典DP算法的基于模型的方法相比，我们讨论了无模型的学习特征。

著录项

来源
《》|2004年|p.1079-1084|共6页
会议地点
作者
Eiji Mizutani; Dreyfus; S.E.;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类无线电电子学、电信技术;
关键词
dynamic programming; stochastic programming; recurrent neural nets; learning (artificial intelligence); decision making; Markov processes; probability; stochastic dynamic programming problems; model free learning; actor-critic reinforcement learning; recurrent neural networks; nonMarkovian settings; nonMarkovian dynamic programming; decision making agent; state transition; partially observable Markov decision processes; error corrupted observations; transitional probability; implicit history memory; state space methods; small scale longest path problems;

机译：动态规划;随机规划;递归神经网络;学习（人工智能）;决策制定;马尔可夫过程;概率;随机动态规划问题;无模型学习;角色批评强化学习;递归神经网络;非马尔可夫设置;非马尔可夫动态规划;决策者;状态转变;部分可观的马尔可夫决策过程;错误破坏的观测结果;转变概率;内隐历史记忆;状态空间方法;小规模最长路径问题;

相似文献

外文文献
中文文献
专利

1. Totally model-free actor-critic recurrent neural-network reinforcement learning in non-Markovian domains [J] . Mizutani Eiji, Dreyfus Stuart Annals of Operations Research . 2017,第1期

机译：非马尔可夫域中的完全无模型的actor-critic递归神经网络强化学习
2. Stackelberg games for model-free continuous-time stochastic systems based on adaptive dynamic programming [J] . Liu Xikui, Ge Yingying, Li Yan Applied mathematics and computation . 2019,第期

机译：基于自适应动态规划的无模型连续时间随机系统的Stackelberg游戏
3. A novel nested stochastic dynamic programming (nSDP) and nested reinforcement learning (nRL) algorithm for multipurpose reservoir optimization [J] . Delipetrev Blagoj, Jonoski Andreja, Solomatine Dimitri P. Journal of Hydroinformatics . 2017,第1a2期

机译：一种用于多功能储层优化的新型嵌套随机动态规划（nSDP）和嵌套强化学习（nRL）算法
4. On using discretized Cohen-Grossberg node dynamics for model-free actor-critic neural learning in non-Markovian domains [C] . Mizutani, E., Dreyfus, . 2003

机译：关于使用离散Cohen-Grossberg节点动力学进行非马尔可夫域中的无模型演员批评神经学习
5. Utility Learning, Non-Markovian Planning, and Task-Oriented Programming Language [D] . Shukla, Nishant. 2019

机译：实用学习，非马尔可夫规划和面向任务的编程语言
6. A novel approach to locomotion learning: Actor-Critic architecture using central pattern generators and dynamic motor primitives [O] . Cai Li, Robert Lowe, Tom Ziemke 2014

机译：运动学习的新方法：使用中央模式生成器和动态运动原语的Actor-Critic体系结构
7. Totally Model-Free Reinforcement Learning by Actor-Critic Elman Networks in Non-Markovian Domains [O] . Eiji Mizutani, Stuart E Dreyfus 1998

机译：非markovian领域的演员 - 评论家Elman网络完全无模型强化学习

Two stochastic dynamic programming problems by model-free actor-critic recurrent-network learning in non-Markovian settings

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅