首页> 外文会议>IEEE Annual Conference on Decision and Control >Convergence and Iteration Complexity of Policy Gradient Method for Infinite-horizon Reinforcement Learning
【24h】

Convergence and Iteration Complexity of Policy Gradient Method for Infinite-horizon Reinforcement Learning

机译:无限地平线加固学习政策梯度法的收敛性和迭代复杂性

获取原文

摘要

We focus on policy search in reinforcement learning problems over continuous spaces, where the value is defined by infinite-horizon discounted reward accumulation. This is the canonical setting proposed by Bellman [3]. Policy search, specifically, policy gradient (PG) method, scales gracefully to problems with continuous spaces and allows for deep network parametrizations; however, experimentally it is known to be volatile and its finite-time behavior is not well understood. A major source of this gap is that unbiased ascent directions are elusive, and hence only asymptotic convergence to stationarity can be shown via links to ordinary differential equations [4]. In this work, we propose a new variant of PG methods that uses a random rollout horizon for the Monte-Carlo estimation of the policy gradient, which we establish yields an unbiased policy search direction. Furthermore, we conduct global convergence analysis from a nonconvex optimization perspective: (i) we first recover the results of asymptotic convergence to the stationary-point policies in the literature through an alternative supermartingale argument; (ii) we provide iteration complexity, i.e., convergence rate, of policy gradient in the infinite-horizon setting, showing that it exhibits comparable rates to stochastic gradient method in the nonconvex regime for diminishing and constant stepsize rules. Numerical experiments on the inverted pendulum demonstrate the validity of our results.
机译:我们专注于在连续空间中加固学习问题的政策搜索,其中该值由无限地平线折扣累积累积定义。这是Bellman提出的规范设置[3]。策略搜索,具体而言,策略渐变(PG)方法,优雅地缩放到连续空格的问题,允许深度网络参数化;但是,实验所知,它是挥发性的,并且其有限时间不太了解。这种差距的主要来源是,无偏见的上升方向是难以捉摸的,因此只能通过与普通微分方程的链接显示对实质性的渐近收敛[4]。在这项工作中,我们提出了一种新的PG方法的变种,该方法使用随机卷展线的蒙特-Carlo估计政策梯度的蒙特卡罗估算,我们建立了不偏见的政策搜索方向。此外,我们从非渗透优化角度进行全局收敛分析:(i)我们首先通过替代超级特写论证恢复文献中的稳定点政策的渐近趋同的结果; (ii)我们在无限地平线设置中提供迭代复杂性,即收敛率,即政策梯度,表明它在非凸起制度中表现出与随机梯度方法的可比速率,以减少和常量步骤规则。倒立摆的数值实验证明了我们的结果的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号