【24h】

A novel multi-step reinforcement learning method for solving reward hacking

机译:一种用于解决奖励黑客的新型多步强化学习方法

获取原文
获取原文并翻译 | 示例
获取外文期刊封面目录资料

摘要

Reinforcement learning with appropriately designed reward signal could be used to solve many sequential learning problems. However, in practice, the reinforcement learning algorithms could be broken in unexpected, counterintuitive ways. One of the failure modes is reward hacking which usually happens when a reward function makes the agent obtain high return in an unexpected way. This unexpected way may subvert the designer's intentions and lead to accidents during training. In this paper, a new multi-step state-action value algorithm is proposed to solve the problem of reward hacking. Unlike traditional algorithms, the proposed method uses a new return function, which alters the discount of future rewards and no longer stresses the immediate reward as the main influence when selecting the current state action. The performance of the proposed method is evaluated on two games, Mappy and Mountain Car. The empirical results demonstrate that the proposed method can alleviate the negative impact of reward hacking and greatly improve the performance of reinforcement learning algorithm. Moreover, the results illustrate that the proposed method could also be applied to the continuous state space problem successfully.
机译:使用适当设计的奖励信号的强化学习可用于解决许多连续学习问题。然而,在实践中,加强学习算法可能会在意外,违反直觉方面被打破。其中一个故障模式是奖励黑客,通常在奖励功能使代理以意想不到的方式获得高回报时发生的。这种意想不到的方式可能会颠覆设计师的意图并导致培训期间的意外。本文提出了一种新的多步骤状态动作值算法来解决奖励黑客问题。与传统算法不同,所提出的方法使用新的返回功能,这改变了未来奖励的折扣,并且在选择当前状态行动时不再强调即时奖励作为主要影响。拟议方法的性能在两场比赛,映像和山地汽车上进行评估。实证结果表明,该方法可以缓解奖励黑客的负面影响,大大提高加强学习算法的性能。此外,结果说明了所提出的方法也可以成功应用于连续的状态空间问题。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号