A novel multi-step reinforcement learning method for solving reward hacking

Yuan Yinlong; Yu Zhu Liang; Gu Zhenghui; Deng Xiaoyan; Li Yuanqing

首页> 外文期刊>Applied Intelligence: The International Journal of Artificial Intelligence, Neural Networks, and Complex Problem-Solving Technologies >A novel multi-step reinforcement learning method for solving reward hacking

【24h】

A novel multi-step reinforcement learning method for solving reward hacking

机译：一种用于解决奖励黑客的新型多步强化学习方法

获取原文

获取原文并翻译 | 示例

获取外文期刊封面目录资料

开具论文收录证明 >>

文献代查 >>

文献数据库（团队版） >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Reinforcement learning with appropriately designed reward signal could be used to solve many sequential learning problems. However, in practice, the reinforcement learning algorithms could be broken in unexpected, counterintuitive ways. One of the failure modes is reward hacking which usually happens when a reward function makes the agent obtain high return in an unexpected way. This unexpected way may subvert the designer's intentions and lead to accidents during training. In this paper, a new multi-step state-action value algorithm is proposed to solve the problem of reward hacking. Unlike traditional algorithms, the proposed method uses a new return function, which alters the discount of future rewards and no longer stresses the immediate reward as the main influence when selecting the current state action. The performance of the proposed method is evaluated on two games, Mappy and Mountain Car. The empirical results demonstrate that the proposed method can alleviate the negative impact of reward hacking and greatly improve the performance of reinforcement learning algorithm. Moreover, the results illustrate that the proposed method could also be applied to the continuous state space problem successfully.

机译：使用适当设计的奖励信号的强化学习可用于解决许多连续学习问题。然而，在实践中，加强学习算法可能会在意外，违反直觉方面被打破。其中一个故障模式是奖励黑客，通常在奖励功能使代理以意想不到的方式获得高回报时发生的。这种意想不到的方式可能会颠覆设计师的意图并导致培训期间的意外。本文提出了一种新的多步骤状态动作值算法来解决奖励黑客问题。与传统算法不同，所提出的方法使用新的返回功能，这改变了未来奖励的折扣，并且在选择当前状态行动时不再强调即时奖励作为主要影响。拟议方法的性能在两场比赛，映像和山地汽车上进行评估。实证结果表明，该方法可以缓解奖励黑客的负面影响，大大提高加强学习算法的性能。此外，结果说明了所提出的方法也可以成功应用于连续的状态空间问题。

著录项

来源
《Applied Intelligence: The International Journal of Artificial Intelligence, Neural Networks, and Complex Problem-Solving Technologies》 |2019年第8期|共15页
作者
Yuan Yinlong; Yu Zhu Liang; Gu Zhenghui; Deng Xiaoyan; Li Yuanqing;
展开▼
作者单位

South China Univ Technol Coll Automat Sci &

Engn Guangzhou 510641 Guangdong Peoples R China;

South China Univ Technol Coll Automat Sci &

Engn Guangzhou 510641 Guangdong Peoples R China;

South China Univ Technol Coll Automat Sci &

Engn Guangzhou 510641 Guangdong Peoples R China;

South China Univ Technol Coll Automat Sci &

Engn Guangzhou 510641 Guangdong Peoples R China;

South China Univ Technol Coll Automat Sci &

Engn Guangzhou 510641 Guangdong Peoples R China;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类自动化技术、计算机技术;
关键词
Reinforcement learning; Robotics; Reward hacking; Multi-step methods;

机译：强化学习;机器人;奖励黑客攻击;多步骤方法;

相似文献

外文文献
中文文献
专利

1. A novel multi-step reinforcement learning method for solving reward hacking [J] . Yuan Yinlong, Yu Zhu Liang, Gu Zhenghui, Applied Intelligence: The International Journal of Artificial Intelligence, Neural Networks, and Complex Problem-Solving Technologies . 2019,第8期

机译：一种用于解决奖励黑客的新型多步强化学习方法
2. A novel multi-step Q-learning method to improve data efficiency for deep reinforcement learning [J] . Yuan Yinlong, Yu Zhu Liang, Gu Zhenghui, Knowledge-Based Systems . 2019,第JULa1期

机译：一种新型的多步Q学习方法，可提高深度强化学习的数据效率
3. A novel multi-step Q-learning method to improve data efficiency for deep reinforcement learning [J] . Yuan Yinlong, Yu Zhu Liang, Gu Zhenghui, Knowledge-Based Systems . 2019,第Jul1期

机译：一种新型多步Q学习方法，提高深增强学习数据效率
4. The Effect of Multi-step Methods on Overestimation in Deep Reinforcement Learning [C] . Lingheng Meng, Rob Gorbet, Dana Kulić International Conference on Pattern Recognition . 2021

机译：多步法方法对深加固学习高估的影响
5. Learning Policies for Model-Based Reinforcement Learning Using Distributed Reward Formulation [D] . Agarwal, Nikhil. 2021

机译：使用分布式奖励制定学习基于模型的强化学习的政策
6. A Reward Optimization Method Based on Action Subrewards in Hierarchical Reinforcement Learning [O] . Yuchen Fu, Quan Liu, Xionghong Ling, -1

机译：分层强化学习中基于动作子奖励的奖励优化方法
7. The Effect of Multi-step Methods on Overestimation in Deep Reinforcement Learning [O] . Lingheng Meng, Rob Gorbet, Dana Kulic 2021

机译：多步法对深增强学学习高估的影响
8. Framing Reinforcement Learning from Human Reward: Reward Positivity, Temporal Discounting, Episodicity, and Performance. [R] . Knox, W. B., Stone, P. 2014

机译：从人类奖励中学习强化学习：奖励积极性，时间贴现，情节性和表现。

A novel multi-step reinforcement learning method for solving reward hacking

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅