首页> 外文OA文献 >Theory and Application of Reward Shaping in Reinforcement Learning
【2h】

Theory and Application of Reward Shaping in Reinforcement Learning

机译:奖励塑造在强化学习中的理论与应用

摘要

Applying conventional reinforcement to complex domains requires the use of an overly simplified task model, or a large amount of training experience. This problem results from the need to experience everything about an environment before gaining confidence in a course of action. But for most interesting problems, the domain is far too large to be exhaustively explored. We address this disparity with reward shaping - a technique that provides localized feedback based on prior knowledge to guide the learning process. By using localized advice, learning is focused into the most relevant areas, which allows for efficient optimization, even in complex domains. We propose a complete theory for the process of reward shaping that demonstrates how it accelerates learning, what the ideal shaping rewards are like, and how to express prior knowledge in order to enhance the learning process. Central to our analysis is the idea of the reward horizon, which characterizes the delay between an action and accurate estimation of its value. In order to maintain focused learning, the goal of reward shaping is to promote a low reward horizon. One type of reward that always generates a low reward horizon is opportunity value. Opportunity value is the value for choosing one action rather than doing nothing. This information, when combined with the native rewards, is enough to decide the best action immediately. Using opportunity value as a model, we suggest subgoal shaping and dynamic shaping as techniques to communicate whatever prior knowledge is available. We demonstrate our theory with two applications: a stochastic gridworld, and a bipedal walking control task. In all cases, the experiments uphold the analytical predictions; most notably that reducing the reward horizon implies faster learning. The bipedal walking task demonstrates that our reward shaping techniques allow a conventional reinforcement learning algorithm to find a good behavior efficiently despite a large state space with stochastic actions.
机译:将常规加固应用于复杂领域需要使用过于简化的任务模型,或大量的培训经验。该问题是由于需要在获得对操作过程的信心之前体验有关环境的一切。但是对于大多数有趣的问题,该领域太大了,无法详尽地探讨。我们通过奖励整形来解决这种差距,奖励整形是一种基于先验知识提供本地化反馈以指导学习过程的技术。通过使用本地化建议,学习可以集中到最相关的领域,即使在复杂的领域中也可以进行有效的优化。我们为奖励塑造的过程提出了一个完整的理论,以证明它如何促进学习,理想的塑造奖励是什么样的,以及如何表达先验知识以增强学习过程。分析的核心是奖赏范围的概念,它表示行为与准确估计其价值之间的延迟。为了保持专注的学习,奖励塑造的目的是促进较低的奖励视野。总是产生低奖励范围的一种奖励是机会价值。机会价值是选择一项行动而不是什么都不做的价值。与本地奖励结合使用时,此信息足以立即确定最佳操作。使用机会价值作为模型,我们建议将子目标塑造和动态塑造作为传达任何现有知识的技术。我们通过两个应用展示了我们的理论:随机网格世界和双足步行控制任务。在所有情况下,实验均符合分析预测;最值得注意的是,减少奖励范围意味着学习速度更快。双足步行任务证明了我们的奖励整形技术允许常规的强化学习算法有效地找到良好的行为,尽管状态空间很大且具有随机动作。

著录项

  • 作者

    Laud Adam Daniel;

  • 作者单位
  • 年度 2004
  • 总页数
  • 原文格式 PDF
  • 正文语种 {"code":"en","name":"English","id":9}
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号