Robust Reinforcement Learning with a Stochastic Value Function

机译：具有随机值函数的强大强化学习

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The field of reinforcement learning has been significantly advanced by the application of deep learning. The Deep Deterministic Policy Gradient(DDPG), an actor-critic method for continuous control, can derive satisfactory policies by use of a deep neural network. However, in common with other deep neural networks, the DDPG requires a large number of training samples and careful hyperparameter tuning. In this paper, we propose a Stochastic Value Function (SVF) that treats a value function such as the Q function as a stochastic variable that can be sampled from N(μ_Q,σ_Q). To learn the appropriate value functions, we use Bayesian regression with KL divergence in place of simple regression with squared errors. We demonstrate that the technique used in Trust Region Policy Optimization (TRPO) can provide efficient learning. We implemented DDPG with SVF (DDPG-SVF) and confirmed (1) that DDPG-SVF converged well, with high sampling efficiency, (2) that DDPG-SVF obtained good results while requiring less hyperparameter tuning, and (3) that the TRPO technique offers an effective way of addressing the hyperparameter tuning problem.

机译：通过深度学习的应用，强化学习领域得到了显着发展。深度确定性策略梯度（DDPG）是一种用于连续控制的行为者批判方法，可以使用深度神经网络来得出令人满意的策略。但是，与其他深度神经网络一样，DDPG需要大量训练样本和仔细的超参数调整。在本文中，我们提出了一种随机值函数（SVF），该函数将诸如Q函数之类的值函数视为可以从N（μ_Q，σ_Q）进行采样的随机变量。要学习适当的值函数，我们使用带KL散度的贝叶斯回归代替具有平方误差的简单回归。我们证明了在信任区域策略优化（TRPO）中使用的技术可以提供有效的学习。我们在SVF（DDPG-SVF）上实现了DDPG，并确认（1）DDPG-SVF收敛良好，采样效率高；（2）DDPG-SVF获得了良好的结果，同时需要较少的超参数调整；（3）TRPO技术提供了解决超参数调整问题的有效方法。

著录项

来源
《International workshop on machine learning, optimization, and big data》|2017年|519-526|共8页
会议地点 Volterra(IT)
作者
Reiji Hatsugai; Mary Inaba;
展开▼
作者单位

The University of Tokyo Tokyo Japan;

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Robust Reinforcement Learning for Stochastic Linear Quadratic Control with Multiplicative Noise ? [J] . Bo Pang, Zhong-Ping Jiang IFAC PapersOnLine . 2021,第7期

机译：具有乘法噪声的随机线性二次控制的鲁棒增强学习？
2. Reinforcement Learning for Continuous Stochastic Actions: An Approximation of Probability Density Function by Orthogonal Wave Function Expansion [J] . Hideki SATOH IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences . 2006,第8期

机译：连续随机动作的强化学习：通过正交波函数展开的概率密度函数逼近
3. Reinforcement Learning for Continuous Stochastic Actions - An Approximation of Probability Density Function by Orthogonal Expansion [J] . Hideki SATOH 電子情報通信学会技術研究報告. 非線形問題. Nonlinear Problems . 2005,第206期

机译：连续随机动作的强化学习-正交扩展的概率密度函数逼近
4. Robust Reinforcement Learning with a Stochastic Value Function [C] . Reiji Hatsugai, Mary Inaba International Workshop on Machine Learning, Optimization, and Big Data . 2018

机译：具有随机值函数的强大强化学习
5. A Smoothing Framework for Stochastic Continuous-Time Reinforcement Learning Problem [D] . Hu, Bowen. 2021

机译：用于随机连续时间增强学习问题的平滑框架
6. Scaled free-energy based reinforcement learning for robust and efficient learning in high-dimensional state spaces [O] . Stefan Elfwing, Eiji Uchibe, Kenji Doya 2013

机译：基于缩放自由能的增强学习可在高维状态空间中进行健壮和高效的学习
7. Robust Model Predictive Shielding for Safe Reinforcement Learning with Stochastic Dynamics [O] . Shuo Li, Osbert Bastani 2020

机译：随机动力学安全强化学习的鲁棒模型预测屏蔽

Robust Reinforcement Learning with a Stochastic Value Function

摘要

著录项

相似文献

相关主题

期刊订阅