首页> 外文期刊>Neural Networks and Learning Systems, IEEE Transactions on >An Improved N-Step Value Gradient Learning Adaptive Dynamic Programming Algorithm for Online Learning
【24h】

An Improved N-Step Value Gradient Learning Adaptive Dynamic Programming Algorithm for Online Learning

机译:在线学习改进的N步值梯度学习自适应动态规划算法

获取原文
获取原文并翻译 | 示例
           

摘要

In problems with complex dynamics and challenging state spaces, the dual heuristic programming (DHP) algorithm has been shown theoretically and experimentally to perform well. This was recently extended by an approach called value gradient learning (VGL). VGL was inspired by a version of temporal difference (TD) learning that uses eligibility traces. The eligibility traces create an exponential decay of older observations with a decay parameter (lambda). This approach is known as TD(lambda), and its DHP extension is known as VGL(lambda), where VGL(0) is identical to DHP. VGL has presented convergence and other desirable properties, but it is primarily useful for batch learning. Online learning requires an eligibility-trace-work-space matrix, which is not required for the batch learning version of VGL. Since online learning is desirable for many applications, it is important to remove this computational and memory impediment. This paper introduces a dual-critic version of VGL, called N-step VGL (NSVGL), that does not need the eligibility-trace-workspace matrix, thereby allowing online learning. Furthermore, this combination of critic networks allows an NSVGL algorithm to learn faster. The first critic is similar to DHP, which is adapted based on TD(0) learning, while the second critic is adapted based on a gradient of n-step TD(lambda) learning. Both networks are combined to train an actor network. The combination of feedback signals from both critic networks provides an optimal decision faster than traditional adaptive dynamic programming (ADP) via mixing current information and event history. Convergence proofs are provided. Gradients of one- and n-step value functions are monotonically nondecreasing and converge to the optimum. Two simulation case studies are presented for NSVGL to show their superior performance.
机译:在复杂动态和具有挑战性状态空间的问题中,理论上并通过实验表明双启发式编程(DHP)算法表现良好。这最近被称为价值梯度学习(VGL)的方法延伸。 VGL灵感来自使用资格迹线的时间差异(TD)学习的版本。资格迹线创建具有衰减参数(Lambda)的较旧观察的指数衰减。这种方法称为TD(Lambda),其DHP延伸被称为VGL(Lambda),其中VGL(0)与DHP相同。 VGL呈现了收敛和其他理想的属性,但它主要用于批量学习。在线学习需要一个资格 - 追踪工作空间矩阵,这是博士批量学习版本不需要的。由于在线学习对于许多应用来说是可取的,因此消除该计算和内存障碍非常重要。本文介绍了一个名为N-STEP VGL(NSVGL)的Dual-批评版,不需要资格跟踪工作空间矩阵,从而允许在线学习。此外,批评网络的这种组合允许NSVGL算法更快地学习。第一个评论家类似于DHP,该DHP是基于TD(0)学习的调整,而第二批评者基于N步骤TD(Lambda)学习的梯度来调整。两个网络都组合为培训演员网络。来自两个批评网络的反馈信号的组合通过混合当前信息和事件历史来提供比传统的自适应动态编程(ADP)更快的最佳决策。提供了收敛性证明。单位和n阶段值函数的梯度是单调的,并融合到最佳状态。对于NSVGL来说,提供了两种模拟案例研究表明其卓越的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号