...
【24h】

An exploratory rollout policy for imagination-augmented agents

机译:用于想象力增强代理商的探索性推广政策

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

Typical reinforcement learning methods usually lack planning and thus require large amounts of training data to achieve the expected performance. Imagination-Augmented Agents(I2A) based on a model-based method learns to extract information from the imagined trajectories to construct implicit plans and show improved data efficiency and performance. However, in I2A, these imagined trajectories are generated by a shared rollout policy, which makes these trajectories look similar and contain little information. We propose an exploratory rollout policy named E-I2A. When the agent's performance is poor, E-I2A produces diversity in the imagined trajectories that are more informative. When the agent's performance is improved with training, the trajectories generated by E-I2A are consistent with agent trajectories in the real environment and produce high rewards. To achieve this, first we formulate the novelty of one state through training an inverse dynamic model and then the agent picks the states with the highest novelty to generate diverse trajectories. Simultaneously, we train a distilled value function model to estimate the expected return of one state. By doing this, we can imagine the state with the highest return that makes the imagined trajectories consistent with the real trajectories. Finally, we propose an adaptive method to improve the agent's performance that produces consistent imagined trajectories that were originally very diverse. Our method demonstrates improved performance and data efficiency through offering more information when making decisions. We evaluated E-I2A on several challenging domains including Minipacman and Sokoban; E-I2A can outperform several baselines.
机译:典型的钢筋学习方法通​​常缺乏规划,因此需要大量的培训数据来实现预期的性能。基于基于模型的方法的想象力增强代理(I2A)了解从想象的轨迹中提取信息以构建隐式计划并显示改进的数据效率和性能。然而,在I2A中,这些想象的轨迹由共享的推出策略产生,这使得这些轨迹看起来类似并且包含很少的信息。我们提出了一个名为E-I2A的探索性推广策略。当代理人的性能较差时,E-I2A在更具信息丰富的想象轨迹中产生多样性。当代理的性能随着训练得到改善时,E-I2A产生的轨迹与真实环境中的代理轨迹一致,并产生高奖励。为实现这一目标,首先我们通过培训逆动态模型制定一个州的新颖性,然后代理商用最高的新颖性来挑选各种轨迹。同时,我们训练蒸馏价值函数模型来估计一个州的预期返回。通过这样做,我们可以想象具有最高回报的状态,使得想象的轨迹与真实轨迹一致。最后,我们提出了一种自适应方法来提高代理商的性能,这些性能产生了最初是非常多样化的始终如一的想象的轨迹。我们的方法通过在做出决定时提供更多信息来展示改进的性能和数据效率。我们在几个具有挑战性的领域中评估了E-I2A,包括小木屋和索科坎; e-i2a可以优于几个基线。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号