【24h】

Value-based Algorithms Optimization with Discounted Multiple-step Learning Method in Deep Reinforcement Learning

机译:基于价值的算法优化,在深增强学习中的折扣多步学习方法

获取原文

摘要

Value-based algorithms have been demonstrated on a range of deep reinforcement learning tasks. However, value-based algorithms suffer from challenges in the aspect of stability, overestimate and convergence, which limit severely the application of such algorithms in real-word environment. In n-step learning method, truncated N-step return was used as a part of multiple-step targets to make faster learning, and improve the defect of such algorithms, but it is still far away to practical application. In this paper, we proposed a straightforward optimal method — Discount Multiple-steps Learning Method (DMLM) to improve the performance of value-based algorithms by giving a discount factor to truncated N-step return which shows better results in our experiments. In this method, regard the discounted truncated N-step return rather than accumulated discount reward as the important part of target network when computing the TD-error as a loss function of evaluate network. In the experiment part, we perform experiments compare to value-based algorithms without this method, and prove this method can make more accurate predict of value function, thereby outperform other optimal methods in terms of stability, overestimate and convergence.
机译:基于价值的算法已经证明了一系列深度加强学习任务。然而,基于价值的算法在稳定性,高估和收敛方面存在挑战,这限制了这种算法在实际环境中的应用。在N步学习方法中,截断的N步返回被用作多步目标的一部分,以制定更快的学习,并提高这种算法的缺陷,但它仍然远离实际应用。在本文中,我们提出了一种直接的最佳方法 - 通过为我们的实验提供折扣因素来提高基于价值的算法的性能,提高基于价值的算法的性能。在此方法中,将折扣截断的N步骤返回,而不是累计折扣奖励作为目标网络的重要组成部分,当计算TD误差作为评估网络的丢失功能时。在实验部件中,我们执行实验与基于价值的算法进行比较,没有这种方法,并证明了这种方法可以更准确地预测价值函数,从而在稳定性,高估和收敛方面优于其他最佳方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号