首页> 外文会议>Conference on Neural Information Processing Systems >Reconciling λ-Returns with Experience Replay
【24h】

Reconciling λ-Returns with Experience Replay

机译:重新调整λ-返回的经验重播

获取原文
获取外文期刊封面目录资料

摘要

Modern deep reinforcement learning methods have departed from the incremental learning required for eligibility traces, rendering the implementation of the λ-return difficult in this context. In particular, off-policy methods that utilize experience replay remain problematic because their random sampling of minibatches is not conducive to the efficient calculation of λ-returns. Yet replay-based methods are often the most sample efficient, and incorporating λ-returns into them is a viable way to achieve new state-of-the-art performance. Towards this, we propose the first method to enable practical use of λ-returns in arbitrary replay-based methods without relying on other forms of decorrelation such as asynchronous gradient updates. By promoting short sequences of past transitions into a small cache within the replay memory, adjacent λ-returns can be efficiently precomputed by sharing Q-values. Computation is not wasted on experiences that are never sampled, and stored λ-returns behave as stable temporal-difference (TD) targets that replace the target network. Additionally, our method grants the unique ability to observe TD errors prior to sampling; for the first time, transitions can be prioritized by their true significance rather than by a proxy to it. Furthermore, we propose the novel use of the TD error to dynamically select λ-values that facilitate faster learning. We show that these innovations can enhance the performance of DQN when playing Atari 2600 games, even under partial observability. While our work specifically focuses on λ-returns, these ideas are applicable to any multi-step return estimator.
机译:现代化的深层加强学习方法已经离开了资格迹线所需的增量学习,在此上下文中渲染λ返回的实现。特别是,利用体验重放的违规方法仍然存在问题,因为它们的随机抽样不利于有效计算λ返回的计算。然而,基于重播的方法通常是最具样本的高效,并将λ返回的返回结合到它们是实现新的最先进性能的可行方法。为此,我们提出了第一种方法来实现任意重播的方法中的λ返回的实际使用,而不依赖于其他形式的去相关性等异步梯度更新。通过将过去过渡的短序列推广到重放存储器内的小缓存中,可以通过共享Q值有效地预先计算相邻的λ返回。在从未采样的经验上不浪费计算,存储λ返回的表现为替换目标网络的稳定时间差(Td)目标。此外,我们的方法授予在采样之前观察TD误差的独特能力;首次,转换可以通过其真正的意义而不是通过它来优先考虑。此外,我们提出了新颖的使用TD误差来动态地选择λ-值,这些值促进更快地学习。我们表明,这些创新可以在atari 2600游戏时提高DQN的表现,即使在部分可观察性下也是如此。虽然我们的工作专门关注λ返回,但这些想法适用于任何多步骤返回估计。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号