...
首页> 外文期刊>JMLR: Workshop and Conference Proceedings >Root-n-Regret for Learning in Markov Decision Processes with Function Approximation and Low Bellman Rank
【24h】

Root-n-Regret for Learning in Markov Decision Processes with Function Approximation and Low Bellman Rank

机译:Root-n-遗憾在马尔可夫决策过程中学习功能近似和低贝尔曼等级

获取原文
           

摘要

In this paper, we consider the problem of online learning of Markov decision processes (MDPs) with very large state spaces. Under the assumptions of realizable function approximation and low Bellman ranks, we develop an online learning algorithm that learns the optimal value function while at the same time achieving very low cumulative regret during the learning process. Our learning algorithm, Adaptive Value-function Elimination (AVE), is inspired by the policy elimination algorithm proposed in (Jiang et al., 2017), known as OLIVE. One of our key technical contributions in AVE is to formulate the elimination steps in OLIVE as contextual bandit problems. This technique enables us to apply the active elimination and expert weighting methods from (Dudik et al., 2011), instead of the random action exploration scheme used in the original OLIVE algorithm, for more efficient exploration and better control of the regret incurred in each policy elimination step. To the best of our knowledge, this is the first root-n-regret result for reinforcement learning in stochastic MDPs with general value function approximation.
机译:在本文中,我们考虑了具有非常大状态空间的马尔可夫决策过程的在线学习问题(MDP)。在可实现的函数近似和低贝尔曼排名的假设下,我们开发了一个在线学习算法,该算法学习了最佳值函数,同时在学习过程中实现非常低的累积遗憾。我们的学习算法,自适应价值函数消除(AVE)是由(江等,2017)所提出的策略消除算法,称为橄榄。我们在AVE中的主要技术贡献之一是制定橄榄中的消除步骤作为上下文匪徒问题。该技术使我们能够从(Dudik等,2011),而不是原始橄榄算法中使用的随机动作探索方案应用了主动消除和专家加权方法,以更有效的探索,更好地控制每个所产生的后悔政策消除步骤。据我们所知,这是一个具有一般值函数近似的随机MDP中的加强学习的第一个root-n-遗憾。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号