【24h】

Safe Policy Improvement with Soft Baseline Bootstrapping

机译:通过软基准引导安全地改进策略

获取原文

摘要

Batch Reinforcement Learning (Batch RL) consists in training a policy using trajectories collected with another policy, called the behavioural policy. Safe policy improvement (SPI) provides guarantees with high probability that the trained policy performs better than the behavioural policy, also called baseline in this setting. Previous work shows that the SPI objective improves mean performance as compared to using the basic RL objective, which boils down to solving the MDP with maximum likelihood (Laroche et al. 2019). Here, we build on that work and improve more precisely the SPI with Baseline Bootstrapping algorithm (SPIBB) by allowing the policy search over a wider set of policies. Instead of binarily classifying the state-action pairs into two sets (the uncertain and the safe-to-train-on ones), we adopt a softer strategy that controls the error in the value estimates by constraining the policy change according to the local model uncertainty. The method can take more risks on uncertain actions all the while remaining provably-safe, and is therefore less conservative than the state-of-the-art methods. We propose two algorithms (one optimal and one approximate) to solve this constrained optimization problem and empirically show a significant improvement over existing SPI algorithms both on finite MDPS and on infinite MDPs with a neural network function approximation.
机译:批量强化学习(Batch RL)包括使用与另一种策略(称为行为策略)一起收集的轨迹来训练策略。安全策略改进(SPI)可以确保训练有素的策略比行为策略(在这种情况下也称为基准)执行得更好的可能性更高。先前的工作表明,与使用基本RL目标相比,SPI目标可以提高平均绩效,归结为以最大可能性解决MDP(Laroche et al.2019)。在此,我们基于该工作,并通过允许对更广泛的策略集进行策略搜索来更精确地改进具有基线自举算法(SPIBB)的SPI。我们没有采用将状态操作对分为两类(不确定和安全训练对)的方式,而是采用了一种更软的策略,该策略通过根据局部模型约束政策变化来控制价值估算中的误差。不确定。该方法在保持可证明的安全性的同时始终会对不确定的动作承担更多的风险,因此不如最新方法保守。我们提出了两种算法(一种是最佳算法,一种是近似算法)来解决此约束优化问题,并在有限的MDPS和具有神经网络功能逼近的无限MDP上从经验上显示了对现有SPI算法的显着改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号