首页> 外文期刊>Annals of Operations Research >Totally model-free actor-critic recurrent neural-network reinforcement learning in non-Markovian domains
【24h】

Totally model-free actor-critic recurrent neural-network reinforcement learning in non-Markovian domains

机译:非马尔可夫域中的完全无模型的actor-critic递归神经网络强化学习

获取原文
获取原文并翻译 | 示例
       

摘要

For solving a sequential decision-making problem in a non-Markovian domain, standard dynamic programming (DP) requires a complete mathematical model; hence, a totally model-based approach. By contrast, this paper describes a totally model-free approach by actor-critic reinforcement learning with recurrent neural networks. The recurrent connections (or context units) in neural networks act as an implicit form of internal state (i.e., history memory) for developing sensitivity to hidden non-Markovian dependencies, rendering the process Markovian implicitly and automatically in a totally model-free fashion. That is, the model-free recurrent-network agent neither learns transitional probabilities and associated rewards, nor by how much the state space should be enlarged so that the Markov property holds. For concreteness, we illustrate time-lagged path problems, in which our learning agent is expected to learn a best (history-dependent) policy that maximizes the total return, the sum of one-step transitional rewards plus special "bonus" values dependent on prior transitions or decisions. Since we can obtain an optimal solution by model-based DP, this is an excellent test on the learning agent for understanding its model-free learning behavior. Such actor-critic recurrent-network learning might constitute a mechanism which animal brains use when experientially acquiring skilled action. Given a concrete non-Markovian problem example, the goal of this paper is to show the conceptual merit of totally model-free learning with actor-critic recurrent networks, compared with classical DP (and other model-building procedures), rather than pursue a best recurrent-network learning strategy.
机译:为了解决非马尔可夫域中的顺序决策问题,标准动态规划(DP)需要完整的数学模型。因此,完全基于模型的方法。相比之下,本文描述了使用行为者批判强化学习和递归神经网络的完全无模型方法。神经网络中的循环连接(或上下文单元)充当内部状态(即历史记忆)的隐式形式,以发展对隐藏的非马尔可夫依赖项的敏感性,以完全无模型的方式隐式自动地呈现过程马尔可夫。也就是说,无模型的递归网络代理既不学习过渡概率和相关的奖励,也不学习状态空间应扩大多少以使马尔可夫性质成立。具体来说,我们举例说明了时间滞后的路径问题,其中我们的学习代理期望学习一种最佳(历史相关)的策略,该策略最大程度地提高总回报,一步过渡奖励的总和以及取决于以下条件的特殊“奖励”值事先的过渡或决定。由于我们可以通过基于模型的DP获得最佳解决方案,因此这是对学习代理了解其无模型学习行为的绝佳测试。这种行为者批判性的递归网络学习可能构成动物大脑在经验上获得熟练动作时所使用的一种机制。给定一个具体的非马尔可夫问题示例,本文的目的是展示与经典DP(和其他模型构建过程)相比,使用行为者评论性递归网络进行完全无模型学习的概念优点,最佳的递归网络学习策略。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号