首页> 外文期刊>JMLR: Workshop and Conference Proceedings >Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs
【24h】

Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs

机译:MDP中无差异强化学习的方差感知后悔范围

获取原文
       

摘要

The problem of reinforcement learning in an unknown and discrete Markov Decision Process (MDP) under the average-reward criterion is considered, when the learner interacts with the system in a single stream of observations, starting from an initial state without any reset. We revisit the minimax lower bound for that problem by making appear the local variance of the bias function in place of the diameter of the MDP. Furthermore, we provide a novel analysis of the exttt{extsc{KL-Ucrl}} algorithm establishing a high-probability regret bound scaling as $widetilde {mathcal O}Bigl({extstyle sqrt{Ssum_{s,a}{f V}^star_{s,a}T}}Big)$ for this algorithm for ergodic MDPs, where $S$ denotes the number of states and where ${f V}^star_{s,a}$ is the variance of the bias function with respect to the next-state distribution following action $a$ in state $s$. The resulting bound improves upon the best previously known regret bound $widetilde {Ocal}(DSsqrt{AT})$ for that algorithm, where $A$ and $D$ respectively denote the maximum number of actions (per state) and the diameter of MDP. We finally compare the leading terms of the two bounds in some benchmark MDPs indicating that the derived bound can provide an order of magnitude improvement in some cases. Our analysis leverages novel variations of the transportation lemma combined with Kullback-Leibler concentration inequalities, that we believe to be of independent interest.
机译:当学习者从初始状态开始没有任何重置的情况下,在单次观察流中与系统交互时,就会考虑在平均奖励标准下以未知和离散的马尔可夫决策过程(MDP)进行强化学习的问题。我们通过显示偏差函数的局部方差代替MDP的直径来重新考虑该问题的minimax下界。此外,我们对 texttt { textsc {KL-Ucrl}}算法进行了新颖的分析,建立了以$ widetilde { mathcal O} Bigl({ textstyle sqrt {S sum_ {s,a} { bf V} ^ star_ {s,a} T}} Big)$,用于遍历MDP的此算法,其中$ S $表示状态数,而$ { bf V} ^ star_ {s,a} $是相对于状态$ s $中的动作$ a $之后的下一个状态分布的偏差函数的方差。结果边界改进了该算法的最佳已知后悔边界$ widetilde { Ocal}(DS sqrt {AT})$,其中$ A $和$ D $分别表示最大操作数(每个状态)和MDP的直径。最后,我们在一些基准MDP中比较了两个边界的前导项,表明在某些情况下,派生的边界可以提供一个数量级的改进。我们的分析利用了运输引理的新颖变体,并结合了Kullback-Leibler浓度不等式,我们认为这是独立利益。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号