Convergence and Iteration Complexity of Policy Gradient Method for Infinite-horizon Reinforcement Learning

机译：无限地平线加固学习政策梯度法的收敛性和迭代复杂性

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

We focus on policy search in reinforcement learning problems over continuous spaces, where the value is defined by infinite-horizon discounted reward accumulation. This is the canonical setting proposed by Bellman [3]. Policy search, specifically, policy gradient (PG) method, scales gracefully to problems with continuous spaces and allows for deep network parametrizations; however, experimentally it is known to be volatile and its finite-time behavior is not well understood. A major source of this gap is that unbiased ascent directions are elusive, and hence only asymptotic convergence to stationarity can be shown via links to ordinary differential equations [4]. In this work, we propose a new variant of PG methods that uses a random rollout horizon for the Monte-Carlo estimation of the policy gradient, which we establish yields an unbiased policy search direction. Furthermore, we conduct global convergence analysis from a nonconvex optimization perspective: (i) we first recover the results of asymptotic convergence to the stationary-point policies in the literature through an alternative supermartingale argument; (ii) we provide iteration complexity, i.e., convergence rate, of policy gradient in the infinite-horizon setting, showing that it exhibits comparable rates to stochastic gradient method in the nonconvex regime for diminishing and constant stepsize rules. Numerical experiments on the inverted pendulum demonstrate the validity of our results.

机译：我们专注于在连续空间中加固学习问题的政策搜索，其中该值由无限地平线折扣累积累积定义。这是Bellman提出的规范设置[3]。策略搜索，具体而言，策略渐变（PG）方法，优雅地缩放到连续空格的问题，允许深度网络参数化;但是，实验所知，它是挥发性的，并且其有限时间不太了解。这种差距的主要来源是，无偏见的上升方向是难以捉摸的，因此只能通过与普通微分方程的链接显示对实质性的渐近收敛[4]。在这项工作中，我们提出了一种新的PG方法的变种，该方法使用随机卷展线的蒙特-Carlo估计政策梯度的蒙特卡罗估算，我们建立了不偏见的政策搜索方向。此外，我们从非渗透优化角度进行全局收敛分析：（i）我们首先通过替代超级特写论证恢复文献中的稳定点政策的渐近趋同的结果; （ii）我们在无限地平线设置中提供迭代复杂性，即收敛率，即政策梯度，表明它在非凸起制度中表现出与随机梯度方法的可比速率，以减少和常量步骤规则。倒立摆的数值实验证明了我们的结果的有效性。

著录项

来源
《IEEE Annual Conference on Decision and Control》|2019年|p6004-6760|共8页
会议地点
作者
Kaiqing Zhang; Alec Koppel; Hao Zhu; Tamer Basar;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类自动控制理论;
关键词

相似文献

外文文献
中文文献
专利

1. A Reinforcement Learning Algorithm Based on Policy Iteration for Average Reward: Empirical Results with Yield Management and Convergence Analysis [J] . ABHIJIT GOSAVI Machine Learning . 2004,第1期

机译：一种基于策略迭代的平均奖励强化学习算法：收益管理与收敛性分析的实证结果
2. PP-PG: Combining Parameter Perturbation with Policy Gradient Methods for Effective and Efficient Explorations in Deep Reinforcement Learning [J] . Li Shilei, Li Meng, Su Jiongming, ACM transactions on intelligent systems and technology . 2021,第3期

机译：PP-PG：将参数扰动与政策梯度方法相结合，为深加固学习中有效和高效的探索
3. Comparing Policy Gradient and Value Function Based Reinforcement Learning Methods in Simulated Electrical Power Trade [J] . Lincoln R., Galloway S., Stephen B., Power Systems, IEEE Transactions on . 2012,第1期

机译：模拟电力贸易中基于策略梯度和价值函数的强化学习方法比较
4. Convergence and Iteration Complexity of Policy Gradient Method for Infinite-horizon Reinforcement Learning [C] . Kaiqing Zhang, Alec Koppel, Hao Zhu, IEEE Annual Conference on Decision and Control . 2019

机译：无限地平线加固学习政策梯度法的收敛性和迭代复杂性
5. On the convergence of model -free policy iteration algorithms for reinforcement learning: Stochastic approximation under discontinuous mean dynamics. [D] . Williams, John Kevin. 2000

机译：关于用于增强学习的无模型策略迭代算法的收敛：不连续平均动力学下的随机逼近。
6. Correction: Spike-Based Reinforcement Learning in Continuous State and Action Space: When Policy Gradient Methods Fail [O] . Eleni Vasilaki, Nicolas Frémaux, Robert Urbanczik, 2009

机译：更正：在连续状态和动作空间中基于峰值的强化学习：当策略梯度方法失败时
7. A Reinforcement Learning Algorithm Based on Policy Iteration for Average Reward: Empirical Results with Yield Management and Convergence Analysis [O] . Abhijit Gosavi 2004

机译：基于普通奖励政策迭代的加强学习算法：屈服管理和收敛分析的经验结果
8. Distributed Reinforcement Learning for Policy Synchronization in Infinite-Horizon Dec-POMDPs. [R] . Banerjee, B., Kraemer, L. 2012

机译：无限地平线Dec-pOmDp中策略同步的分布式强化学习。

Convergence and Iteration Complexity of Policy Gradient Method for Infinite-horizon Reinforcement Learning

摘要

著录项

相似文献

相关主题

期刊订阅