Safe Policy Improvement with Soft Baseline Bootstrapping

机译：通过软基准引导安全地改进策略

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Batch Reinforcement Learning (Batch RL) consists in training a policy using trajectories collected with another policy, called the behavioural policy. Safe policy improvement (SPI) provides guarantees with high probability that the trained policy performs better than the behavioural policy, also called baseline in this setting. Previous work shows that the SPI objective improves mean performance as compared to using the basic RL objective, which boils down to solving the MDP with maximum likelihood (Laroche et al. 2019). Here, we build on that work and improve more precisely the SPI with Baseline Bootstrapping algorithm (SPIBB) by allowing the policy search over a wider set of policies. Instead of binarily classifying the state-action pairs into two sets (the uncertain and the safe-to-train-on ones), we adopt a softer strategy that controls the error in the value estimates by constraining the policy change according to the local model uncertainty. The method can take more risks on uncertain actions all the while remaining provably-safe, and is therefore less conservative than the state-of-the-art methods. We propose two algorithms (one optimal and one approximate) to solve this constrained optimization problem and empirically show a significant improvement over existing SPI algorithms both on finite MDPS and on infinite MDPs with a neural network function approximation.

机译：批量强化学习（Batch RL）包括使用与另一种策略（称为行为策略）一起收集的轨迹来训练策略。安全策略改进（SPI）可以确保训练有素的策略比行为策略（在这种情况下也称为基准）执行得更好的可能性更高。先前的工作表明，与使用基本RL目标相比，SPI目标可以提高平均绩效，归结为以最大可能性解决MDP（Laroche et al.2019）。在此，我们基于该工作，并通过允许对更广泛的策略集进行策略搜索来更精确地改进具有基线自举算法（SPIBB）的SPI。我们没有采用将状态操作对分为两类（不确定和安全训练对）的方式，而是采用了一种更软的策略，该策略通过根据局部模型约束政策变化来控制价值估算中的误差。不确定。该方法在保持可证明的安全性的同时始终会对不确定的动作承担更多的风险，因此不如最新方法保守。我们提出了两种算法（一种是最佳算法，一种是近似算法）来解决此约束优化问题，并在有限的MDPS和具有神经网络功能逼近的无限MDP上从经验上显示了对现有SPI算法的显着改进。

著录项

来源
《European conference on machine learning and principles and practice of knowledge discovery in databases》|2019年|53-68|共16页
会议地点
作者
Kimia Nadjahi; Romain Laroche; Remi Tachet des Combes;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Patient safety culture in Peking University Cancer Hospital in China: baseline assessment and comparative analysis for quality improvement [J] . Xiyao Zhong, Yuqin Song, Christine Dennis, BMC Health Services Research . 2019,第1期

机译：中国北京大学患者安全文化在中国：基线评估与质量改进的比较分析
2. Can a continuous quality improvement program create culturally safe emergency departments for Aboriginal people in Australia? A multiple baseline study [J] . Thomas Gadsden, Gai Wilson, James Totterdell, BMC Health Services Research . 2019,第1期

机译：持续的质量改善计划可以为澳大利亚的土着人民创造文化安全的急诊部门吗？多个基线研究
3. A cross-sectional study to assess the patient safety culture in the Palestinian hospitals: a baseline assessment for quality improvement [J] . Aymen Elsous, Ali Akbari Sari, Arash Rashidian, Journal of the Royal Society of Medicine . 2016,第12期

机译：评估巴勒斯坦医院患者安全文化的横断面研究：质量改善的基线评估
4. Safe Policy Improvement with Baseline Bootstrapping [C] . Romain Laroche, Paul Trichelair, Remi Tachet des Combes International Conference on Machine Learning . 2019

机译：基线自动启动安全政策改进
5. Safety kernel enforcement of software safety policies. [D] . Wika, Kevin G. 1995

机译：软件安全策略的安全内核实施。
6. Can a continuous quality improvement program create culturally safe emergency departments for Aboriginal people in Australia? A multiple baseline study [O] . Thomas Gadsden, Gai Wilson, James Totterdell, 2019

机译：一项持续的质量改进计划是否可以为澳大利亚的原住民建立文化上安全的紧急部门？多基线研究
7. Nothing soft about ‘soft skills’: core competencies in quality improvement and patient safety education and practice [O] . Joanne Goldman, Brian M Wong 2020

机译：关于'软技能'的毫无柔软：质量改善和患者安全教育和实践中的核心能力

Safe Policy Improvement with Soft Baseline Bootstrapping

摘要

著录项

相似文献

相关主题

期刊订阅