An Improved N-Step Value Gradient Learning Adaptive Dynamic Programming Algorithm for Online Learning

Al-Dabooni Seaar; Wunsch Donald C. II

首页> 外文期刊>Neural Networks and Learning Systems, IEEE Transactions on >An Improved N-Step Value Gradient Learning Adaptive Dynamic Programming Algorithm for Online Learning

【24h】

An Improved N-Step Value Gradient Learning Adaptive Dynamic Programming Algorithm for Online Learning

机译：在线学习改进的N步值梯度学习自适应动态规划算法

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

In problems with complex dynamics and challenging state spaces, the dual heuristic programming (DHP) algorithm has been shown theoretically and experimentally to perform well. This was recently extended by an approach called value gradient learning (VGL). VGL was inspired by a version of temporal difference (TD) learning that uses eligibility traces. The eligibility traces create an exponential decay of older observations with a decay parameter (lambda). This approach is known as TD(lambda), and its DHP extension is known as VGL(lambda), where VGL(0) is identical to DHP. VGL has presented convergence and other desirable properties, but it is primarily useful for batch learning. Online learning requires an eligibility-trace-work-space matrix, which is not required for the batch learning version of VGL. Since online learning is desirable for many applications, it is important to remove this computational and memory impediment. This paper introduces a dual-critic version of VGL, called N-step VGL (NSVGL), that does not need the eligibility-trace-workspace matrix, thereby allowing online learning. Furthermore, this combination of critic networks allows an NSVGL algorithm to learn faster. The first critic is similar to DHP, which is adapted based on TD(0) learning, while the second critic is adapted based on a gradient of n-step TD(lambda) learning. Both networks are combined to train an actor network. The combination of feedback signals from both critic networks provides an optimal decision faster than traditional adaptive dynamic programming (ADP) via mixing current information and event history. Convergence proofs are provided. Gradients of one- and n-step value functions are monotonically nondecreasing and converge to the optimum. Two simulation case studies are presented for NSVGL to show their superior performance.

机译：在复杂动态和具有挑战性状态空间的问题中，理论上并通过实验表明双启发式编程（DHP）算法表现良好。这最近被称为价值梯度学习（VGL）的方法延伸。 VGL灵感来自使用资格迹线的时间差异（TD）学习的版本。资格迹线创建具有衰减参数（Lambda）的较旧观察的指数衰减。这种方法称为TD（Lambda），其DHP延伸被称为VGL（Lambda），其中VGL（0）与DHP相同。 VGL呈现了收敛和其他理想的属性，但它主要用于批量学习。在线学习需要一个资格 - 追踪工作空间矩阵，这是博士批量学习版本不需要的。由于在线学习对于许多应用来说是可取的，因此消除该计算和内存障碍非常重要。本文介绍了一个名为N-STEP VGL（NSVGL）的Dual-批评版，不需要资格跟踪工作空间矩阵，从而允许在线学习。此外，批评网络的这种组合允许NSVGL算法更快地学习。第一个评论家类似于DHP，该DHP是基于TD（0）学习的调整，而第二批评者基于N步骤TD（Lambda）学习的梯度来调整。两个网络都组合为培训演员网络。来自两个批评网络的反馈信号的组合通过混合当前信息和事件历史来提供比传统的自适应动态编程（ADP）更快的最佳决策。提供了收敛性证明。单位和n阶段值函数的梯度是单调的，并融合到最佳状态。对于NSVGL来说，提供了两种模拟案例研究表明其卓越的性能。

著录项

来源
《Neural Networks and Learning Systems, IEEE Transactions on》 |2020年第4期|1155-1169|共15页
作者
Al-Dabooni Seaar; Wunsch Donald C. II;
展开▼
作者单位

Missouri Univ Sci & Technol ACIL Rolla MO 65401 USA|Basra Oil Co Basra 61030 Iraq;

Missouri Univ Sci & Technol Dept Elect & Comp Engn ACIL Rolla MO 65401 USA;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Adaptive dynamic programming (ADP); convergence analysis; eligibility traces; online learning; reinforcement learning; temporal difference (TD); value gradient learning (VGL);

机译：自适应动态编程（ADP）;收敛分析;资格痕迹;在线学习;加强学习;时间差异（TD）;价值梯度学习（VGL）;

相似文献

外文文献
中文文献
专利

1. Langevin Dynamics for Adaptive Inverse Reinforcement Learning of Stochastic Gradient Algorithms [J] . Vikram Krishnamurthy, George Yin Journal of machine learning research . 2021,第a期

机译：随机梯度算法自适应逆加固学习的Langevin动态
2. Convergence and numerical stability of action-dependent heuristic dynamic programming algorithms based on RLS learning for online DLQR optimal control [J] . Guilherme Bonfim De Sousa, Patrícia Helena Moraes Rêgo International Journal of Computational Science and Engineering . 2019,第3期

机译：基于RLS学习在线DLQR最优控制的动态启发式动态规划算法的收敛性和数值稳定性
3. Neural-network-based learning algorithms for cooperative games of discrete-time multi-player systems with control constraints via adaptive dynamic programming [J] . Jiang He, Zhang Huaguang, Xie Xiangpeng, Neurocomputing . 2019,第JUNa7期

机译：基于神经网络的学习算法，通过自适应动态规划，用于具有控制约束的离散多人系统的合作游戏
4. Online learning control based on projected gradient temporal difference and advanced heuristic dynamic programming [C] . Fu Jian, Wei Sujuan, He Haibo, International Joint Conference on Neural Networks . 2014

机译：基于投影梯度时差和高级启发式动态规划的在线学习控制
5. Theory and Algorithms for Dynamic and Adaptive Online Learning. [D] . Yang, Scott. 2017

机译：动态和自适应在线学习的理论和算法。
6. Improved Prediction of Drug-Induced Torsades de Pointes Through Simulations of Dynamics and Machine Learning Algorithms [O] . M Cummins Lancaster, EA Sobie -1

机译：通过动力学和机器学习算法的仿真来改进对毒品诱发的尖锐湿疣的预测
7. Online Learning over Dynamic Graphs via Distributed Proximal Gradient Algorithm [O] . Rishabh Dixit, Amrit Singh Bedi, Ketan Rajawat 2020

机译：通过分布式近端梯度算法在线学习动态图

An Improved N-Step Value Gradient Learning Adaptive Dynamic Programming Algorithm for Online Learning

摘要

著录项

相似文献

相关主题

期刊订阅