首页> 外文会议>Joint IEEE International Conference on Development and Learning and Epigenetic Robotics >Parallel reward and punishment control in humans and robots: Safe reinforcement learning using the MaxPain algorithm
【24h】

Parallel reward and punishment control in humans and robots: Safe reinforcement learning using the MaxPain algorithm

机译:人与机器人的并行奖惩控制:使用MaxPain算法的安全强化学习

获取原文

摘要

An important issue in reinforcement learning systems for autonomous agents is whether it makes sense to have separate systems for predicting rewards and punishments. In robotics, learning and control are typically achieved by a single controller, with punishments coded as negative rewards. However in biological systems, some evidence suggests that the brain has a separate system for punishment. Although this may in part be due to biological constraints of implementing negative quantities, it raises the question as to whether there is any computational rationale for keeping reward and punishment prediction operationally distinct. Here we outline a basic argument supporting this idea, based on the proposition that learning best-case predictions (as in Q-learning) does not always achieve the safest behaviour. We introduce a modified RL scheme involving a new algorithm which we call 'MaxPain' - which back-ups worst-case predictions in parallel, and then scales the two predictions in a multiattribute RL policy. i.e. independently learning `what to do' as well as `what not to do' and then combining this information. We show how this scheme can improve performance in benchmark RL environments, including a grid-world experiment and delayed version of the mountain car experiment. In particular, we demonstrate how early exploration and learning are substantially improved, leading to much `safer' behaviour. In conclusion, the results illustrate the importance of independent punishment prediction in RL, and provide a testable framework for better understanding punishment (such as pain) and avoidance in humans, in both health and disease.
机译:自主代理人强化学习系统中的一个重要问题是,使用单独的系统来预测奖励和惩罚是否有意义。在机器人技术中,学习和控制通常由单个控制器实现,惩罚被编码为负面奖励。但是,在生物系统中,一些证据表明大脑有单独的惩罚系统。尽管这可能部分是由于实施负数量的生物学限制所致,但它提出了一个问题,即是否有任何计算依据可以使奖励和惩罚预测在操作上保持不同。在此,我们提出了一个基本论点,该论据基于以下假设:学习最佳情况的预测(如在Q学习中)并不总是实现最安全的行为。我们介绍了一种修改后的RL方案,其中涉及一种称为“ MaxPain”的新算法-该算法并行备份最坏情况的预测,然后在多属性RL策略中扩展这两个预测。即独立学习“做什么”和“不做什么”,然后组合这些信息。我们将展示该方案如何在基准RL环境中提高性能,包括网格世界实验和山地车实验的延迟版本。特别是,我们展示了如何大大改善早期的探索和学习,从而带来更多“更安全”的行为。总之,研究结果说明了在RL中独立惩罚预测的重要性,并为更好地理解人类在健康和疾病方面的惩罚(例如疼痛)和回避提供了可检验的框架。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号