Root-n-Regret for Learning in Markov Decision Processes with Function Approximation and Low Bellman Rank

Kefan Dong; Jian Peng; Yining Wang; Yuan Zhou

首页> 外文期刊>JMLR: Workshop and Conference Proceedings >Root-n-Regret for Learning in Markov Decision Processes with Function Approximation and Low Bellman Rank

【24h】

Root-n-Regret for Learning in Markov Decision Processes with Function Approximation and Low Bellman Rank

机译：Root-n-遗憾在马尔可夫决策过程中学习功能近似和低贝尔曼等级

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

In this paper, we consider the problem of online learning of Markov decision processes (MDPs) with very large state spaces. Under the assumptions of realizable function approximation and low Bellman ranks, we develop an online learning algorithm that learns the optimal value function while at the same time achieving very low cumulative regret during the learning process. Our learning algorithm, Adaptive Value-function Elimination (AVE), is inspired by the policy elimination algorithm proposed in (Jiang et al., 2017), known as OLIVE. One of our key technical contributions in AVE is to formulate the elimination steps in OLIVE as contextual bandit problems. This technique enables us to apply the active elimination and expert weighting methods from (Dudik et al., 2011), instead of the random action exploration scheme used in the original OLIVE algorithm, for more efficient exploration and better control of the regret incurred in each policy elimination step. To the best of our knowledge, this is the first root-n-regret result for reinforcement learning in stochastic MDPs with general value function approximation.

机译：在本文中，我们考虑了具有非常大状态空间的马尔可夫决策过程的在线学习问题（MDP）。在可实现的函数近似和低贝尔曼排名的假设下，我们开发了一个在线学习算法，该算法学习了最佳值函数，同时在学习过程中实现非常低的累积遗憾。我们的学习算法，自适应价值函数消除（AVE）是由（江等，2017）所提出的策略消除算法，称为橄榄。我们在AVE中的主要技术贡献之一是制定橄榄中的消除步骤作为上下文匪徒问题。该技术使我们能够从（Dudik等，2011），而不是原始橄榄算法中使用的随机动作探索方案应用了主动消除和专家加权方法，以更有效的探索，更好地控制每个所产生的后悔政策消除步骤。据我们所知，这是一个具有一般值函数近似的随机MDP中的加强学习的第一个root-n-遗憾。

著录项

来源
《JMLR: Workshop and Conference Proceedings》 |2020年第2010期|共4页
作者
Kefan Dong; Jian Peng; Yining Wang; Yuan Zhou;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Contextual Decision Processes with low Bellman rank are PAC-Learnable [J] . Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, JMLR: Workshop and Conference Proceedings . 2017,第3期

机译：Bellman等级较低的情境决策过程可通过PAC学习
2. Value-Function Approximations for Partially Observable Markov Decision Processes [J] . Hauskrecht M. The Journal of Artificial Intelligence Research . 2000,第7期

机译：部分可观察的马尔可夫决策过程的价值函数近似
3. Value-Function Approximations for Partially Observable Markov Decision Processes [J] . Milos Hauskrecht The Journal of Artificial Intelligence Research . 2000,第0期

机译：部分可观察的马尔可夫决策过程的价值函数近似
4. Weighted difference approximation of value functions for slow-discounting Markov Decision Processes [C] . Yin-Lam Chow, Junjie Qin IEEE Annual Conference on Decision and Control . 2014

机译：慢折扣马尔可夫决策过程的值函数的加权差分近似
5. Linear approximations for factored Markov decision processes. [D] . Patrascu, Relu-Eugen. 2005

机译：因子马尔可夫决策过程的线性近似。
6. Data-Driven Markov Decision Process Approximations for PersonalizedHypertension Treatment Planning [O] . Greggory J. Schell, Wesley J. Marrero, Mariel S. Lavieri, 2016

机译：数据驱动的个性化马尔可夫决策过程近似高血压治疗计划
7. Weighted Difference Approximation of Value Functions for Slow-Discounting Markov Decision Processes [O] . Chow, Yin-Lam, Qin, Junjie 2014

机译：函数值函数的加权差分近似缓慢贴现的马尔可夫决策过程

Root-n-Regret for Learning in Markov Decision Processes with Function Approximation and Low Bellman Rank

摘要

著录项

相似文献

相关主题

期刊订阅