首页> 外文会议>International Conference on Machine Learning >Dead-ends and Secure Exploration in Reinforcement Learning
【24h】

Dead-ends and Secure Exploration in Reinforcement Learning

机译:终止和加固学习安全探索

获取原文

摘要

Many interesting applications of reinforcement learning (RL) involve MDPs that include numerous "dead-end" states. Upon reaching a dead-end state, the agent continues to interact with the environment in a dead-end trajectory before reaching an undesired terminal state, regardless of whatever actions are chosen. The situation is even worse when existence of many dead-end states is coupled with distant positive rewards from any initial state (we term this as Bridge Effect). Hence, conventional exploration techniques often incur prohibitively many training steps before convergence. To deal with the bridge effect, we propose a condition for exploration, called security. We next establish formal results that translate the security condition into the learning problem of an auxiliary value function. This new value function is used to cap "any" given exploration policy and is guaranteed to make it secure. As a special case, we use this theory and introduce secure random-walk. We next extend our results to the deep RL settings by identifying and addressing two main challenges that arise. Finally, we empirically compare secure random-walk with standard benchmarks in two sets of experiments including the Atari game of Montezuma's Revenge.
机译:钢筋学习(RL)的许多有趣的应用涉及MDP,包括许多“死端”状态。在达到死端状态时,在达到不希望的终端状态之前,代理程序继续与死端轨迹中的环境相互作用,无论选择是否选择了任何行动。当许多死端状态的存在与来自任何初始状态的远程正奖励相结合时,情况更差(我们将其作为桥梁效应术语)。因此,传统的勘探技术经常在收敛之前产生许多培训步骤。要处理桥梁效果,我们提出了一种探索的条件,称为安全性。我们接下来建立正式结果,将安全条件转化为辅助值函数的学习问题。此新值函数用于缩写“任何”给定的探索政策,并保证使其安全。作为一个特殊情况,我们使用这个理论并介绍安全随机漫步。我们接下来通过识别和解决出现的两个主要挑战,将结果扩展到深度RL设置。最后,我们经验与两套实验中的标准基准进行了明确的准确性,包括蒙特萨的复仇的Atari游戏。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号