首页> 外文期刊>JMLR: Workshop and Conference Proceedings >Transfer in Deep Reinforcement Learning Using Successor Features and Generalised Policy Improvement
【24h】

Transfer in Deep Reinforcement Learning Using Successor Features and Generalised Policy Improvement

机译:使用后续功能和通用策略改进进行深度强化学习中的转换

获取原文
       

摘要

The ability to transfer skills across tasks has the potential to scale up reinforcement learning (RL) agents to environments currently out of reach. Recently, a framework based on two ideas, successor features (SFs) and generalised policy improvement (GPI), has been introduced as a principled way of transferring skills. In this paper we extend the SF&GPI framework in two ways. One of the basic assumptions underlying the original formulation of SF&GPI is that rewards for all tasks of interest can be computed as linear combinations of a fixed set of features. We relax this constraint and show that the theoretical guarantees supporting the framework can be extended to any set of tasks that only differ in the reward function. Our second contribution is to show that one can use the reward functions themselves as features for future tasks, without any loss of expressiveness, thus removing the need to specify a set of features beforehand. This makes it possible to combine SF&GPI with deep learning in a more stable way. We empirically verify this claim on a complex 3D environment where observations are images from a first-person perspective. We show that the transfer promoted by SF&GPI leads to very good policies on unseen tasks almost instantaneously. We also describe how to learn policies specialised to the new tasks in a way that allows them to be added to the agent’s set of skills, and thus be reused in the future.
机译:跨任务传递技能的能力有可能将强化学习(RL)代理扩展到当前无法到达的环境。最近,已经引入了基于两种思想的框架,即继任特征(SF)和广义策略改进(GPI),作为原则上的技能转移方式。在本文中,我们以两种方式扩展了SF&GPI框架。 SF&GPI原始公式的基本假设之一是,可以将所有感兴趣的任务的奖励计算为一组固定功能的线性组合。我们放宽了此约束,并表明支持该框架的理论保证可以扩展到仅奖励功能不同的任何任务集。我们的第二个贡献是表明,可以将奖励功能本身用作将来任务的功能,而又不会损失任何表达能力,从而无需事先指定一组功能。这使得以更稳定的方式将SF&GPI与深度学习结合在一起成为可能。我们在复杂的3D环境中(从第一人称视角观察图像)经验地验证了这一说法。我们表明,SF&GPI促进的转移几乎在瞬间就导致了针对未见任务的非常好的政策。我们还将介绍如何学习专门针对新任务的策略,以使其能够被添加到座席的技能中,从而在将来被重用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号