首页> 外文期刊>Knowledge-Based Systems >Online fitted policy iteration based on extreme learning machines
【24h】

Online fitted policy iteration based on extreme learning machines

机译:基于极限学习机的在线拟合策略迭代

获取原文
获取原文并翻译 | 示例

摘要

Reinforcement learning (RL) is a learning paradigm that can be useful in a wide variety of real-world applications. However, its applicability to complex problems remains problematic due to different causes. Particularly important among these are the high quantity of data required by the agent to learn useful policies and the poor scalability to high-dimensional problems due to the use of local approximators. This paper presents a novel RL algorithm, called online fitted policy iteration (OFPI), that steps forward in both directions. OFPI is based on a semi-batch scheme that increases the convergence speed by reusing data and enables the use of global approximators by reformulating the value function approximation as a standard supervised problem. The proposed method has been empirically evaluated in three benchmark problems. During the experiments, OFPI has employed a neural network trained with the extreme learning machine algorithm to approximate the value functions. Results have demonstrated the stability of OFPI using a global function approximator and also performance improvements over two baseline algorithms (SARSA and Q-learning) combined with eligibility traces and a radial basis function network. (C) 2016 Elsevier B.V. All rights reserved.
机译:强化学习(RL)是一种学习范例,可在各种实际应用中使用。然而,由于不同的原因,其在复杂问题上的适用性仍然存在问题。其中特别重要的是,代理学习有用策略所需的大量数据,以及由于使用局部逼近器而导致的高维问题的可伸缩性较差。本文提出了一种新颖的RL算法,称为在线拟合策略迭代(OFPI),该算法可双向发展。 OFPI基于半批量方案,该方案通过重用数据来提高收敛速度,并通过将值函数逼近作为标准受监督问题来重新构造全局逼近器。在三个基准问题中对所提出的方法进行了经验评估。在实验过程中,OFPI采用了经过极限学习机算法训练的神经网络来近似值函数。结果证明了使用全局函数逼近器的OFPI的稳定性,以及与资格跟踪和径向基函数网络相结合的两个基线算法(SARSA和Q学习)的性能改进。 (C)2016 Elsevier B.V.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号