...
首页> 外文期刊>Expert Systems with Application >Combination of learning from non-optimal demonstrations and feedbacks using inverse reinforcement learning and Bayesian policy improvement
【24h】

Combination of learning from non-optimal demonstrations and feedbacks using inverse reinforcement learning and Bayesian policy improvement

机译:通过逆向强化学习和贝叶斯政策改进,结合非最佳演示和反馈中的学习

获取原文
获取原文并翻译 | 示例
           

摘要

Inverse reinforcement learning (IRL) is a powerful tool for teaching by demonstrations, provided that sufficiently diverse and optimal demonstrations are given, and learner agent correctly perceives those demonstrations. These conditions are hard to meet in practice; as a trainer cannot cover all possibilities by demonstrations, he may partially fail to follow the optimal behavior. Also, trainer and learner have different perceptions of the environment including trainer's actions. A practical way to overcome these problems is using a combination of trainer's demonstrations and feedbacks.We propose an interactive learning approach to overcome the challenge of non-optimal demonstrations by integrating human evaluative feedbacks with theIRLprocess, given sufficiently diverse demonstrations and the domain transition model. To this end, we develop a probabilistic model of human feedbacks and iteratively improve the agent policy using Bayes rule. We then integrate this information in an extendedIRLalgorithm to enhance the learned reward function.We examine the developed approach in one experimental and two simulated tasks; i.e., a grid world navigation, a highway car driving system and a navigation task by the e-puck robot. Obtained results show significant improved efficiency of the proposed approach in face of having different levels of non-optimality in demonstrations and the number of evaluative feedbacks.
机译:逆向强化学习(IRL)是进行演示教学的强大工具,前提是要给出足够多样化和最佳的演示,并且学习者正确理解了这些演示。在实践中很难满足这些条件。由于培训师无法通过演示涵盖所有可能性,因此他可能会部分地无法遵循最佳行为。同样,培训者和学习者对环境的理解也不同,包括培训者的行为。解决这些问题的一种切实可行的方法是结合培训师的演示和反馈。我们提出了一种交互式学习方法,通过将人类评估反馈与IRL流程集成在一起,从而克服了非最优演示的挑战,并提供了足够多样化的演示和领域转换模型。为此,我们开发了一种人类反馈的概率模型,并使用贝叶斯规则迭代地改进了代理策略。然后,我们将此信息集成到扩展的IR算法中,以增强学习的奖励功能。我们在一个实验任务和两个模拟任务中研究了开发的方法;即网格世界导航,公路汽车驾驶系统和e-puck机器人的导航任务。获得的结果表明,面对演示中不同级别的非最优性以及评估反馈的数量,所提出方法的效率有了显着提高。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号