首页> 外文期刊>Neural Networks and Learning Systems, IEEE Transactions on >Bridging the Gap Between Imitation Learning and Inverse Reinforcement Learning
【24h】

Bridging the Gap Between Imitation Learning and Inverse Reinforcement Learning

机译:弥合模仿学习与反强化学习之间的鸿沟

获取原文
获取原文并翻译 | 示例
           

摘要

Learning from demonstrations is a paradigm by which an apprentice agent learns a control policy for a dynamic environment by observing demonstrations delivered by an expert agent. It is usually implemented as either imitation learning (IL) or inverse reinforcement learning (IRL) in the literature. On the one hand, IRL is a paradigm relying on the Markov decision processes, where the goal of the apprentice agent is to find a reward function from the expert demonstrations that could explain the expert behavior. On the other hand, IL consists in directly generalizing the expert strategy, observed in the demonstrations, to unvisited states (and it is therefore close to classification, when there is a finite set of possible decisions). While these two visions are often considered as opposite to each other, the purpose of this paper is to exhibit a formal link between these approaches from which new algorithms can be derived. We show that IL and IRL can be redefined in a way that they are equivalent, in the sense that there exists an explicit bijective operator (namely, the inverse optimal Bellman operator) between their respective spaces of solutions. To do so, we introduce the set-policy framework that creates a clear link between the IL and the IRL. As a result, the IL and IRL solutions making the best of both worlds are obtained. In addition, it is a unifying framework from which existing IL and IRL algorithms can be derived and which opens the way for the IL methods able to deal with the environment’s dynamics. Finally, the IRL algorithms derived from the set-policy framework are compared with the algorithms belonging to the more common trajectory-matching family. Experiments demonstrate that the set-policy-based algorithms outperform both the standard IRL and IL ones and result in more robust solutions.
机译:从演示中学习是一个范例,学徒代理通过观察专家代理提供的演示来学习动态环境的控制策略。在文献中,通常将其实现为模仿学习(IL)或逆强化学习(IRL)。一方面,IRL是依赖于马尔可夫决策过程的范例,学徒代理的目标是从专家演示中找到可以解释专家行为的奖励函数。另一方面,IL包括将示范中观察到的专家策略直接推广到未访问的状态(因此,在可能的决策集有限的情况下,它接近分类)。虽然这两种愿景通常被认为是彼此相反的,但本文的目的是展示这些方法之间的正式联系,从中可以得出新的算法。我们表明,IL和IRL可以用等效的方式重新定义,即它们各自的解空间之间存在一个明确的双射算子(即最优Bellman逆算子)。为此,我们引入了set-policy框架,该框架在IL和IRL之间创建了明确的链接。结果,获得了两全其美的IL和IRL解决方案。此外,它是一个统一的框架,可以从中导出现有的IL和IRL算法,并为能够处理环境动态的IL方法开辟了道路。最后,将从集合策略框架派生的IRL算法与属于更常见的轨迹匹配家族的算法进行比较。实验表明,基于集合策略的算法优于标准的IRL和IL算法,并提供了更强大的解决方案。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号