首页> 外文期刊>Journal of machine learning research >Off-policy Learning With Eligibility Traces: A Survey
【24h】

Off-policy Learning With Eligibility Traces: A Survey

机译:非政策性学习与资格追踪:一项调查

获取原文

摘要

In the framework of Markov Decision Processes, we considerlinear off-policy learning, that is the problem oflearning a linear approximation of the value function of somefixed policy from one trajectory possibly generated by someother policy. We briefly review on-policy learningalgorithms of the literature (gradient-based and least-squares-based), adopting a unified algorithmic view. Then, we highlighta systematic approach for adapting them to off-policylearning with eligibility traces. This leads to someknown algorithms---off-policy LSTD($lambda$), LSPE($lambda$),TD($lambda$), TDC/GQ($lambda$)---and suggests new extensions---off-policy FPKF($lambda$), BRM($lambda$), gBRM($lambda$),GTD2($lambda$). We describe a comprehensive algorithmicderivation of all algorithms in a recursive and memory-efficentform, discuss their known convergence properties and illustratetheir relative empirical behavior on Garnet problems. Ourexperiments suggest that the most standard algorithms on andoff-policy LSTD($lambda$)/LSPE($lambda$)---and TD($lambda$)if the feature space dimension is too large for a least-squaresapproach---perform the best. color="gray">
机译:在马尔可夫决策过程的框架中,我们考虑线性的偏离政策学习,这是从某个其他策略可能生成的轨迹中学习某个固定策略的价值函数的线性逼近的问题。我们采用统一的算法观点,简要回顾了文献中的基于策略的学习算法(基于梯度和最小二乘法)。然后,我们重点介绍了一种系统的方法,使他们适应带有资格痕迹的脱离政策学习。这导致了一些已知算法-非策略性LSTD($ lambda $),LSPE($ lambda $),TD($ lambda $),TDC / GQ($ lambda $)-并建议新的扩展-政策外的FPKF($ lambda $),BRM($ lambda $),gBRM($ lambda $),GTD2($ lambda $)。我们以递归和记忆有效形式描述了所有算法的综合算法派生,讨论了它们的已知收敛性质,并说明了它们在石榴石问题上的相对经验行为。我们的实验表明,如果特征空间尺寸对于最小二乘方方法而言过大,则基于开和关策略LSTD($ lambda $)/ LSPE($ lambda $)-和TD($ lambda $)的最标准算法表现最好。 color =“ gray”>

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号