首页> 外文会议>Conference on Neural Information Processing Systems >Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies
【24h】

Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies

机译:基于模型的强化学习与贪婪政策的紧张遗憾界限

获取原文

摘要

State-of-the-art efficient model-based Reinforcement Learning (RL) algorithms typically act by iteratively solving empirical models, i.e., by performing full-planning on Markov Decision Processes (MDPs) built by the gathered experience. In this paper, we focus on model-based RL in the finite-state finite-horizon undiscounted MDP setting and establish that exploring with greedy policies - act by 1-step planning - can achieve tight minimax performance in terms of regret, O((HSAT)~(1/2)). Thus, full-planning in model-based RL can be avoided altogether without any performance degradation, and, by doing so, the computational complexity decreases by a factor of S. The results are based on a novel analysis of real-time dynamic programming, then extended to model-based RL. Specifically, we generalize existing algorithms that perform full-planning to act by 1-step planning. For these generalizations, we prove regret bounds with the same rate as their full-planning counterparts.
机译:最先进的基于高效的基于模型的增强学习(RL)算法通常通过迭代地解决经验模型,即通过对由收集的体验建造的马尔可夫决策过程(MDP)执行全面规划。 在本文中,我们专注于有限状态有限地平线未招估的MDP设置,并确定与贪婪政策的探索 - 通过1步规划的探索 - 在后悔,o(( hsat)〜(1/2))。 因此,可以完全避免基于模型的RL的全程规划,而无需任何性能下降,并且通过这样做,计算复杂性减少了一个因素。结果基于对实时动态编程的新分析, 然后扩展到基于模型的RL。 具体而言,我们概括了以1步规划执行全面规划的现有算法。 对于这些概括,我们证明了与他们的全计划同行相同的余额。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号