Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

Andras Antos; Csaba Szepesvari; Remi Munos

首页> 外文期刊>Machine Learning >Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

【24h】

Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

机译：通过基于Bellman残差最小化的拟合策略迭代和单个样本路径来学习接近最优的策略

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

In this paper we consider the problem of finding a near-optimal policy in a continuous space, discounted Markovian Decision Problem (MDP) by employing value - function - based methods when only a single trajectory of a fixed policy is available as the input. We study a policy-iteration algorithm where the iterates are obtained via empirical risk minimization with a risk function that penalizes high magnitudes of the Bellman-residual. Our main result is a finite-sample, high-probability bound on the performance of the computed policy that depends on the mixing rate of the trajectory, the capacity of the function set as measured by a novel capacity concept (the VC-crossing dimension), the approximation power of the function set and the controllability properties of the MDP. Moreover, we prove that when a linear parameterization is used the new algorithm is equivalent to Least-Squares Policy Iteration. To the best of our knowledge this is the first theoretical result for off-policy control learning over continuous state-spaces using a single trajectory.

机译：在本文中，我们考虑的问题是，当仅固定策略的单个轨迹可用作输入时，采用基于价值函数的方法在连续空间中找到近似最优策略，即折现马尔可夫决策问题（MDP）。我们研究了一种策略迭代算法，其中迭代是通过经验风险最小化获得的，该经验风险最小化了贝尔曼残差的高值。我们的主要结果是有限的样本，高概率取决于计算策略的性能，该策略取决于轨迹的混合率，通过新的容量概念（VC交叉维度）测量的函数集的容量，函数集的逼近度和MDP的可控制性。此外，我们证明了使用线性参数化时，新算法等效于最小二乘策略迭代。据我们所知，这是使用单个轨迹在连续状态空间上进行非策略控制学习的第一个理论结果。

著录项

来源
《Machine Learning》 |2008年第1期|p.89-129|共41页
作者
Andras Antos; Csaba Szepesvari; Remi Munos;
展开▼
作者单位

Computer and Automation Research Inst. of the Hungarian Academy of Sciences, Kende u. 13-17, Budapest 1111, Hungary;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类计算技术、计算机技术;
关键词
reinforcement learning; policy iteration; bellman-residual minimization; least-squares temporal difference learning; off-policy learning; nonparametric regression; least-squares regression; finite-sample bounds;

机译：强化学习;策略迭代;贝尔曼残差最小化;最小二乘时差学习;非策略学习;非参数回归;最小二乘回归;有限样本边界;

相似文献

外文文献
中文文献
专利

1. Online fitted policy iteration based on extreme learning machines [J] . Escandell-Montero Pablo, Lorente Delia, Martinez-Martinez Jose M., Knowledge-Based Systems . 2016,第maya15期

机译：基于极限学习机的在线拟合策略迭代
2. Provably near-optimal sampling-based policies for stochastic inventory control models [J] . Levi R, Roundy RO, Shmoys DB Mathematics of operations research . 2007,第4期

机译：可能的基于最优抽样的随机库存控制模型策略
3. Q-learning and policy iteration algorithms for stochastic shortest path problems [J] . Huizhen Yu, Dimitri P. Bertsekas Annals of Operations Research . 2013,第1期

机译：随机最短路径问题的Q学习和策略迭代算法
4. Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path [C] . Andras Antos, Csaba Szepesvari, Remi Munos Annual Conference on Learning Theory(COLT 2006); 20060622-25; Pittsburgh,PA(US) . 2006

机译：通过基于Bellman-残差最小化的拟合策略迭代和单个样本路径学习近乎最优的策略
5. Energy Storage Applications of the Knowledge Gradient for Calibrating Continuous Parameters, Approximate Policy Iteration using Bellman Error Minimization with Instrumental Variables, and Covariance Matrix Estimation using an Errors-in-Variables Factor Model. [D] . Scott, Warren Robert. 2012

机译：知识梯度的能量存储应用，用于校准连续参数，使用带工具变量的Bellman误差最小化进行近似策略迭代以及使用可变误差因子模型进行协方差矩阵估计。
6. Iterative Learning-Based Path and Speed Profile Optimization for an Unmanned Surface Vehicle [O] . Yang Yang, Quan Li, Junnan Zhang, 2020

机译：基于迭代学习的无人水面飞行器路径和速度曲线优化
7. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path [O] . Andras Antos, Csaba Szepesvari 2015

机译：使用基于Bellman残差最小化的拟合策略迭代和单个样本路径学习近似最优策略

Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅