On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift

Alekh Agarwal; Sham M. Kakade; Jason D. Lee; Gaurav Mahajan

首页> 外文期刊>Journal of machine learning research >On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift

【24h】

On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift

机译：关于政策梯度方法的理论：最优，近似和分布换档

获取原文

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Policy gradient methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces. However, little is known about even their most basic theoretical convergence properties, including: if and how fast they converge to a globally optimal solution or how they cope with approximation error due to using a restricted class of parametric policies. This work provides provable characterizations of the computational, approximation, and sample size properties of policy gradient methods in the context of discounted Markov Decision Processes (MDPs). We focus on both: "tabular" policy parameterizations, where the optimal policy is contained in the class and where we show global convergence to the optimal policy; and parametric policy classes (considering both log-linear and neural policy classes), which may not contain the optimal policy and where we provide agnostic learning results. One central contribution of this work is in providing approximation guarantees that are average case --- which avoid explicit worst-case dependencies on the size of state space --- by making a formal connection to supervised learning under distribution shift. This characterization shows an important interplay between estimation error, approximation error, and exploration (as characterized through a precisely defined condition number).

机译：政策梯度方法是具有大状态和/或行动空间的强化学习问题的最有效方法之一。然而，对于即使是它们最基本的理论收敛性，也很少熟知，包括：如果以及如何快速收敛到全局最佳解决方案或它们如何应对由于使用受限制的参数策略而应对近似误差。此工作提供了在折扣Markov决策过程（MDP）的上下文中的策略梯度方法的计算，近似和样本大小属性的可提供表征。我们专注于：“表格”政策参数化，其中最佳政策包含在课堂上以及我们向最佳政策显示全球融合的地方;和参数策略类（考虑对数线性和神经策略类），可能不包含最佳政策以及我们提供不可知论的学习结果。这项工作的一个核心贡献是提供平均案例的近似保证 - 它避免了在状态空间大小的明确最坏情况依赖关系 - 通过在分布班次下进行监督学习的正式连接。此表征显示估计误差，近似误差和探索之间的重要相互作用（通过精确定义的条件号而表征）。

著录项

来源
《Journal of machine learning research》 |2021年第a期|共76页
作者
Alekh Agarwal; Sham M. Kakade; Jason D. Lee; Gaurav Mahajan;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes [J] . Alekh Agarwal, Sham M Kakade, Jason D Lee, JMLR: Workshop and Conference Proceedings . 2020,第2010期

机译：马尔可夫决策过程中的策略梯度方法的最优性和近似
2. New results in applied scattering theory: the physical-statistics approach, including strong multiple scatter versus classical statistical-physical methods and the Born and Rytov approximations versus exact strong scatter probability distributions [J] . Middleton D. Waves in Random Media . 2002,第1期

机译：应用散射理论的新结果：物理统计方法，包括强多重散射与经典统计物理方法以及Born和Rytov逼近与精确强散射概率分布
3. INSIGHTS ON THE PLASMA POLARIZATION SHIFT - A COMPARISON OF LOCAL DENSITY APPROXIMATION AND OPTIMUM POTENTIAL METHODS [J] . Wilson BG., Liberman DA. Journal of Quantitative Spectroscopy & Radiative Transfer . 1995,第1a2期

机译：等离子体极化位移的认识-局部密度近似与最佳势法的比较。
4. Policy Gradient Reinforcement Learning-based Optimal Decoupling Capacitor Design Method for 2.5-D/3-D ICs using Transformer Network [C] . Hyunwook Park, Minsu Kim, Subin Kim, Conference on Electrical Design of Advanced Packaging and Systems . 2020

机译：基于政策梯度加固学习的最优解耦电容器设计方法2.5-D / 3-D ICS使用变压器网络
5. OPTIMAL MAINTENANCE POLICIES FOR MULTICOMPONENT SYSTEMS WITH WEIBULL FAILURE TIMES (PREVENTATIVE, ESTIMATION, DISTRIBUTION, RELIABILITY, RENEWAL THEORY). [D] . SPEARMAN, MARK L. 1986

机译：具有威布尔故障时间（预防性，估计性，分布性，可靠性，更新性理论）的多组件系统的最优维护策略。
6. Network Modeling of Adult Neurogenesis: Shifting Rates of Neuronal Turnover Optimally Gears Network Learning according to Novelty Gradient [O] . R. Andrew Chambers, Susan K. Conroy -1

机译：成人神经发生的网络建模：神经元翻转的变化率根据新奇梯度优化齿轮网络学习
7. The optimal route problem and the method of approximation in policy space [O] . Davidson D, White D.J 1970

机译：策略空间中的最优路径问题和近似方法
8. Insights on the local density approximation plasma polarization shift as provided by the optimum potential method [R] . Wilson, B. , Liberman, D. A. 1995

机译：通过最佳电位法提供的局部密度近似等离子体极化位移的见解

On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift

摘要

著录项

相似文献

相关主题

期刊订阅