首页> 外文期刊>International journal of applied mathematics and computer science >An Active Exploration Method for Data Efficient Reinforcement Learning
【24h】

An Active Exploration Method for Data Efficient Reinforcement Learning

机译:数据有效增强学习的积极探索方法

获取原文
           

摘要

Reinforcement learning (RL) constitutes an effective method of controlling dynamic systems without prior knowledge. One of the most important and difficult problems in RL is the improvement of data efficiency. Probabilistic inference for learning control (PILCO) is a state-of-the-art data-efficient framework that uses a Gaussian process to model dynamic systems. However, it only focuses on optimizing cumulative rewards and does not consider the accuracy of a dynamic model, which is an important factor for controller learning. To further improve the data efficiency of PILCO, we propose its active exploration version (AEPILCO) that utilizes information entropy to describe samples. In the policy evaluation stage, we incorporate an information entropy criterion into long-term sample prediction. Through the informative policy evaluation function, our algorithm obtains informative policy parameters in the policy improvement stage. Using the policy parameters in the actual execution produces an informative sample set; this is helpful in learning an accurate dynamic model. Thus, the AEPILCO algorithm improves data efficiency by learning an accurate dynamic model by actively selecting informative samples based on the information entropy criterion. We demonstrate the validity and efficiency of the proposed algorithm for several challenging controller problems involving a cart pole, a pendubot, a double pendulum, and a cart double pendulum. The AEPILCO algorithm can learn a controller using fewer trials compared to PILCO. This is verified through theoretical analysis and experimental results.
机译:强化学习(RL)构成了无需先验知识的控制动态系统的有效方法。 RL中最重要和最困难的问题之一是提高数据效率。用于学习控制的概率推断(Pilco)是一种最先进的数据有效框架,它使用高斯过程来模拟动态系统。但是,它只专注于优化累积奖励,并且不考虑动态模型的准确性,这是控制器学习的重要因素。为了进一步提高Pilco的数据效率,我们提出了利用信息熵来描述样本的主动探索版本(Aepilco)。在策略评估阶段,我们将信息熵标准纳入了长期样本预测。通过信息丰富的策略评估功能,我们的算法在策略改进阶段获得信息策略参数。使用实际执行中的策略参数生成一个信息集;这有助于学习准确的动态模型。因此,通过基于信息熵标准通过主动选择信息样本来学习准确的动态模型,助归算法通过学习准确的动态模型来提高数据效率。我们展示了涉及推车杆,柱柱,双摆和推车双摆的若干具有挑战性控制器问题的提出算法的有效性和效率。与Pilco相比,Aepilco算法可以使用较少的试验来学习控制器。通过理论分析和实验结果验证了这一点。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号