...
首页> 外文期刊>IEEE Transactions on Robotics >Dual REPS: A Generalization of Relative Entropy Policy Search Exploiting Bad Experiences
【24h】

Dual REPS: A Generalization of Relative Entropy Policy Search Exploiting Bad Experiences

机译:双重REPS:利用不良经验进行相对熵策略搜索的概括

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

Policy search (PS) algorithms are widely used for their simplicity and effectiveness in finding solutions for robotic problems. However, most current PS algorithms derive policies by statistically fitting the data from the best experiments only. This means that experiments yielding a poor performance are usually discarded or given too little influence on the policy update. In this paper, we propose a generalization of the relative entropy policy search (REPS) algorithm that takes bad experiences into consideration when computing a policy. The proposed approach, named dual REPS (DREPS) following the philosophical interpretation of the duality between good and bad, finds clusters of experimental data yielding a poor behavior and adds them to the optimization problem as a repulsive constraint. Thus, considering that there is a duality between good and bad data samples, both are taken into account in the stochastic search for a policy. Additionally, a cluster with the best samples may be included as an attractor to enforce faster convergence to a single optimal solution in multimodal problems. We first tested our proposed approach in a simulated reinforcement learning setting and found that DREPS considerably speeds up the learning process, especially during the early optimization steps and in cases where other approaches get trapped in between several alternative maxima. Further experiments in which a real robot had to learn a task with a multimodal reward function confirm the advantages of our proposed approach with respect to REPS.
机译:策略搜索(PS)算法因其简单性和有效性而广泛用于查找机器人问题的解决方案。但是,当前大多数PS算法仅通过统计拟合最佳实验中的数据来得出策略。这意味着通常会放弃性能不佳的实验,或者对策略更新的影响太小。在本文中,我们提出了一种相对熵策略搜索(REPS)算法的一般化方法,该算法在计算策略时会考虑不良经验。根据对好与坏之间的对偶性的哲学解释,所提出的方法被称为对偶REPS(DREPS),它发现产生不良行为的实验数据簇,并将它们作为排斥性约束添加到优化问题中。因此,考虑到好数据样本与坏数据样本之间存在双重性,因此在随机搜索策略时要同时考虑两者。此外,可以将具有最佳样本的聚类作为吸引子,以在多峰问题中强制更快地收敛到单个最佳解决方案。我们首先在模拟的强化学习环境中测试了我们提出的方法,发现DREPS大大加快了学习过程,尤其是在早期优化步骤中以及在其他方法陷入多个替代最大值之间的情况下。实际机器人必须学习具有多模式奖励功能的任务的进一步实验证实了我们提出的方法在REPS方面的优势。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号