首页> 外文OA文献 >Policy iterations for reinforcement learning problems in continuous time and space — Fundamental theory and methods

【2h】

Policy iterations for reinforcement learning problems in continuous time and space — Fundamental theory and methods

机译：连续时间和空间中加强学习问题的政策迭代 - 基础理论与方法

代理获取

本网站仅为用户提供外文OA文献查询和代理获取服务，本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文，但由于OA文献来源多样且变更频繁，仍可能出现获取不到、文献不完整或与标题不符等情况，如果获取不到我们将提供退款服务。请知悉。

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Policy iteration (PI) is a recursive process of policy evaluation andimprovement to solve an optimal decision-making, e.g., reinforcement learning(RL) or optimal control problem and has served as the fundamental to develop RLmethods. Motivated by integral PI (IPI) schemes in optimal control and RLmethods in continuous time and space (CTS), this paper proposes on-policy IPIto solve the general RL problem in CTS, with its environment modeled by anordinary differential equation (ODE). In such continuous domain, we alsopropose four off-policy IPI methods---two are the ideal PI forms that useadvantage and Q-functions, respectively, and the other two are naturalextensions of the existing off-policy IPI schemes to our general RL framework.Compared to the IPI methods in optimal control, the proposed IPI schemes can beapplied to more general situations and do not require an initial stabilizingpolicy to run; they are also strongly relevant to the RL algorithms in CTS suchas advantage updating, Q-learning, and value-gradient based (VGB) greedy policyimprovement. Our on-policy IPI is basically model-based but can be madepartially model-free; each off-policy method is also either partially orcompletely model-free. The mathematical properties of the IPImethods---admissibility, monotone improvement, and convergence towards theoptimal solution---are all rigorously proven, together with the equivalence ofon- and off-policy IPI. Finally, the IPI methods are simulated with aninverted-pendulum model to support the theory and verify the performance.

机译：策略迭代（PI）是政策评估andimprovement的递归过程来解决一个最佳的决策，例如，强化学习（RL）或最优控制问题，并曾担任为根本，以发展RLmethods。由积分PI（IPI）在最佳控制方案和RLmethods在连续的时间和空间（CTS）的启发，提出在政策IPIto解决一般RL问题在CTS，与其环境由anordinary微分方程（ODE）建模。在这样的连续域，我们alsopropose四休政策IPI方法---两人都是理想的PI形式的useadvantage和Q-功能，分别与另外两个是现有非政策的naturalextensions IPI计划，我们一般RL框架。相对于在最佳控制的方法IPI，所提出的IPI方案可以beapplied到更一般的情况下，不需要在初始stabilizingpolicy运行;它们也对CTS suchas优点更新，Q学习，和值梯度基于（VGB）贪婪policyimprovement的RL算法强烈相关。我们对政策IPI基本上是基于模型的，但可以madepartially无模型;各偏策略方法也是或者部分orcompletely模型 - 自由。在IPImethods ---受理，单调的改善，以及对theoptimal求解收敛的数学特性---都严格证明，用等价ofon-和非政策IPI在一起。最后，IPI方法模拟了aninverted摆模型理论的支持和验证的性能。

著录项

作者
Jaeyoung Lee; Richard S. Sutton;
展开▼
作者单位

展开▼
年度 2021
总页数
原文格式 PDF
正文语种
中图分类

相似文献

外文文献
中文文献
专利

1. Policy derivation methods for critic-only reinforcement learning in continuous spaces [J] . Eduard Alibekov, Jiří Kubalík, Robert Babuška Engineering Applications of Artificial Intelligence . 2018,第MARa期

机译：连续空间中仅限批评家的强化学习的策略推导方法
2. Policy Derivation Methods for Critic-Only Reinforcement Learning in Continuous Action Spaces [J] . Eduard Alibekov, Jiri Kubalik, Robert Babuska IFAC PapersOnLine . 2016,第5期

机译：连续动作空间中仅用于批判性强化学习的策略推导方法
3. Spike-Based Reinforcement Learning in Continuous State and Action Space: When Policy Gradient Methods Fail [J] . Eleni Vasilaki, Nicolas Frémaux, Robert Urbanczik, PLoS Computational Biology . 2009,第12期

机译：连续状态和动作空间中基于峰值的强化学习：当策略梯度方法失败时
4. Policy Derivation Methods for Critic-Only Reinforcement Learning in Continuous Action Spaces [C] . Eduard Alibekov, Jiri Kubalik, Robert Babuska IFAC Conference on Intelligent Control and Automation Sciences . 2016

机译：在连续行动空间中批评的批评加强学习的政策推导方法
5. Efficient approximate policy iteration methods for sequential decision making in reinforcement learning. [D] . Lagoudakis, Michail G. 2003

机译：强化学习中顺序决策的有效近似策略迭代方法。
6. Correction: Spike-Based Reinforcement Learning in Continuous State and Action Space: When Policy Gradient Methods Fail [O] . Eleni Vasilaki, Nicolas Frémaux, Robert Urbanczik, 2009

机译：更正：在连续状态和动作空间中基于峰值的强化学习：当策略梯度方法失败时
7. A comparison of action selection methods for implicit policy method reinforcement learning in continuous action-space [O] . Nichols, Barry D. 2016

机译：连续动作空间中隐式策略方法强化学习的动作选择方法比较

Policy iterations for reinforcement learning problems in continuous time and space — Fundamental theory and methods

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅