首页> 外文期刊>Machine Learning >Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path
【24h】

Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

机译:通过基于Bellman残差最小化的拟合策略迭代和单个样本路径来学习接近最优的策略

获取原文
获取原文并翻译 | 示例

摘要

In this paper we consider the problem of finding a near-optimal policy in a continuous space, discounted Markovian Decision Problem (MDP) by employing value - function - based methods when only a single trajectory of a fixed policy is available as the input. We study a policy-iteration algorithm where the iterates are obtained via empirical risk minimization with a risk function that penalizes high magnitudes of the Bellman-residual. Our main result is a finite-sample, high-probability bound on the performance of the computed policy that depends on the mixing rate of the trajectory, the capacity of the function set as measured by a novel capacity concept (the VC-crossing dimension), the approximation power of the function set and the controllability properties of the MDP. Moreover, we prove that when a linear parameterization is used the new algorithm is equivalent to Least-Squares Policy Iteration. To the best of our knowledge this is the first theoretical result for off-policy control learning over continuous state-spaces using a single trajectory.
机译:在本文中,我们考虑的问题是,当仅固定策略的单个轨迹可用作输入时,采用基于价值函数的方法在连续空间中找到近似最优策略,即折现马尔可夫决策问题(MDP)。我们研究了一种策略迭代算法,其中迭代是通过经验风险最小化获得的,该经验风险最小化了贝尔曼残差的高值。我们的主要结果是有限的样本,高概率取决于计算策略的性能,该策略取决于轨迹的混合率,通过新的容量概念(VC交叉维度)测量的函数集的容量,函数集的逼近度和MDP的可控制性。此外,我们证明了使用线性参数化时,新算法等效于最小二乘策略迭代。据我们所知,这是使用单个轨迹在连续状态空间上进行非策略控制学习的第一个理论结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号