首页> 外文期刊>JMLR: Workshop and Conference Proceedings >The Uncertainty Bellman Equation and Exploration
【24h】

The Uncertainty Bellman Equation and Exploration

机译:不确定贝尔曼方程式与探索

获取原文
       

摘要

We consider the exploration/exploitation problem in reinforcement learning. For exploitation, it is well known that the Bellman equation connects the value at any time-step to the expected value at subsequent time-steps. In this paper we consider a similar uncertainty Bellman equation (UBE), which connects the uncertainty at any time-step to the expected uncertainties at subsequent time-steps, thereby extending the potential exploratory benefit of a policy beyond individual time-steps. We prove that the unique fixed point of the UBE yields an upper bound on the variance of the posterior distribution of the Q-values induced by any policy. This bound can be much tighter than traditional count-based bonuses that compound standard deviation rather than variance. Importantly, and unlike several existing approaches to optimism, this method scales naturally to large systems with complex generalization. Substituting our UBE-exploration strategy for $epsilon$-greedy improves DQN performance on 51 out of 57 games in the Atari suite.
机译:我们在强化学习中考虑探索/开发问题。为了进行开发,众所周知的是,贝尔曼方程式将任何时间步长的值都连接到后续时间步长的期望值。在本文中,我们考虑了类似的不确定性Bellman方程(UBE),该方程将任何时间步长的不确定性与后续时间步长的预期不确定性联系起来,从而将政策的潜在探索利益扩展到了各个时间步长之外。我们证明,UBE的唯一不动点在任何策略引起的Q值的后验分布方差上产生一个上限。这个界限可能比传统的基于计数的奖金更为严格,因为传统的基于计数的奖金使标准差而不是方差增加了。重要的是,与几种现有的乐观方法不同,该方法自然可以扩展到具有复杂泛化的大型系统。将我们的UBE探索策略替换为$ epsilon $ -greedy,可以改善Atari套件中57款游戏中51款的DQN性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号