首页> 外文学位 >Incremental least-squares temporal difference learning.
【24h】

Incremental least-squares temporal difference learning.

机译:增量最小二乘时差学习。

获取原文
获取原文并翻译 | 示例

摘要

Sequential decision making is a challenging problem for the artificial intelligence community. It can be modeled as an agent interacting with an environment according to its policy. Policy iteration methods are a popular approach involving the interleaving of two stages: policy evaluation to compute the desirability of each state with respect to the policy, and policy improvement to improve the current policy with respect to the state values. The effectiveness of this approach is highly dependent on the effectiveness of policy evaluation, which is the focus of this dissertation. The per time step complexity of traditional methods like temporal difference learning (TD) are sublinear in the number of features. Thus, they can be scaled to large environments, however they use training data relatively inefficiently and so require a large number of sample interactions. The least-squares TD (LSTD) method addresses the data inefficiency of TD by using the sum of the TD updates on all past experiences. This makes LSTD a formidable algorithm for tackling problems where data is limited or expensive to gather. However, the computational cost of LSTD cripples its applicability in most large environments. We introduce an incremental version of the LSTD method, called iLSTD, for online policy evaluation in large problems.;On each time step, iLSTD uses the sum TD update vector in a gradient fashion by selecting and descending in a limited set of dimensions. We show that if a sparse feature representation is being used, the iLSTD algorithm's per time step complexity is linear in the number of features whereas for LSTD, it is quadratic. This allows iLSTD to scale up to large environments with many features where LSTD cannot be applied. On the other hand, because iLSTD takes advantage of all data on each time step, it requires far less data than the TD method. Empirical results in the Boyan chain and mountain car environments shows the superiority of iLSTD with respect to TD and the speed advantage of iLSTD with respect to LSTD. We also extend iLSTD with eligibility traces, resulting in iLSTD(lambda), and show that the additional computation does not change the linear per time step complexity. Additionally, we investigate the performance and convergence properties of iLSTD with different dimension selection mechanisms. Finally, we discuss the limitations of this study.
机译:对于人工智能界来说,顺序决策是一个具有挑战性的问题。可以将其建模为根据其策略与环境交互的代理。策略迭代方法是一种流行的方法,涉及两个阶段的交织:策略评估以计算每个状态相对于策略的可取性,以及策略改进以相对于状态值改进当前策略。这种方法的有效性在很大程度上取决于政策评估的有效性,这是本文的重点。传统方法(如时差学习(TD))的每时间步复杂度在特征数量上是次线性的。因此,它们可以扩展到大型环境,但是它们使用培训数据的效率相对较低,因此需要大量的样本交互。最小二乘TD(LSTD)方法通过使用所有过去经验中TD更新的总和来解决TD的数据效率低下的问题。这使得LSTD成为解决数据量有限或收集成本高的问题的强大算法。但是,LSTD的计算成本削弱了它在大多数大型环境中的适用性。我们引入了LSTD方法的增量版本,称为iLSTD,用于在较大问题中进行在线策略评估。;在每个时间步上,iLSTD都会通过选择和降级一组有限的维度来以梯度方式使用TD总和更新向量。我们表明,如果使用稀疏特征表示,则iLSTD算法的每时间步复杂度在特征数量上是线性的,而对于LSTD,它是二次的。这使iLSTD可以扩展到具有许多无法应用LSTD的功能的大型环境。另一方面,由于iLSTD在每个时间步都利用了所有数据,因此它需要的数据要比TD方法少得多。在博扬链和山地车环境中的经验结果表明,iLSTD相对于TD的优越性以及iLSTD相对于LSTD的速度优势。我们还使用资格跟踪对iLSTD进行了扩展,得到了iLSTD(lambda),并显示了额外的计算不会改变每时间步长复杂度的线性关系。此外,我们研究了具有不同维度选择机制的iLSTD的性能和收敛特性。最后,我们讨论了这项研究的局限性。

著录项

  • 作者

    Geramifard, Alborz.;

  • 作者单位

    University of Alberta (Canada).;

  • 授予单位 University of Alberta (Canada).;
  • 学科 Computer science.
  • 学位 M.Sc.
  • 年度 2007
  • 页码 63 p.
  • 总页数 63
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 老年病学;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号