首页> 外文会议>International workshop on machine learning, optimization, and big data >Robust Reinforcement Learning with a Stochastic Value Function
【24h】

Robust Reinforcement Learning with a Stochastic Value Function

机译:具有随机值函数的强大强化学习

获取原文

摘要

The field of reinforcement learning has been significantly advanced by the application of deep learning. The Deep Deterministic Policy Gradient(DDPG), an actor-critic method for continuous control, can derive satisfactory policies by use of a deep neural network. However, in common with other deep neural networks, the DDPG requires a large number of training samples and careful hyperparameter tuning. In this paper, we propose a Stochastic Value Function (SVF) that treats a value function such as the Q function as a stochastic variable that can be sampled from N(μ_Q,σ_Q). To learn the appropriate value functions, we use Bayesian regression with KL divergence in place of simple regression with squared errors. We demonstrate that the technique used in Trust Region Policy Optimization (TRPO) can provide efficient learning. We implemented DDPG with SVF (DDPG-SVF) and confirmed (1) that DDPG-SVF converged well, with high sampling efficiency, (2) that DDPG-SVF obtained good results while requiring less hyperparameter tuning, and (3) that the TRPO technique offers an effective way of addressing the hyperparameter tuning problem.
机译:通过深度学习的应用,强化学习领域得到了显着发展。深度确定性策略梯度(DDPG)是一种用于连续控制的行为者批判方法,可以使用深度神经网络来得出令人满意的策略。但是,与其他深度神经网络一样,DDPG需要大量训练样本和仔细的超参数调整。在本文中,我们提出了一种随机值函数(SVF),该函数将诸如Q函数之类的值函数视为可以从N(μ_Q,σ_Q)进行采样的随机变量。要学习适当的值函数,我们使用带KL散度的贝叶斯回归代替具有平方误差的简单回归。我们证明了在信任区域策略优化(TRPO)中使用的技术可以提供有效的学习。我们在SVF(DDPG-SVF)上实现了DDPG,并确认(1)DDPG-SVF收敛良好,采样效率高;(2)DDPG-SVF获得了良好的结果,同时需要较少的超参数调整;(3)TRPO技术提供了解决超参数调整问题的有效方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号