...
首页> 外文期刊>JMLR: Workshop and Conference Proceedings >Stable Policy Optimization via Off-Policy Divergence Regularization
【24h】

Stable Policy Optimization via Off-Policy Divergence Regularization

机译:通过脱助政策发散正规化稳定的政策优化

获取原文
           

摘要

Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL). While these methods achieve state-of-the-art performance across a wide range of challenging tasks, there is room for improvement in the stabilization of the policy learning and how the off-policy data are used. In this paper we revisit the theoretical foundations of these algorithms and propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another. This proximity term, expressed in terms of the divergence between the visitation distributions, is learned in an off-policy and adversarial manner. We empirically show that our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
机译:信任区域政策优化(TRPO)和近端政策优化(PPO)是深度加强学习(RL)中最成功的政策梯度方法。虽然这些方法在广泛的具有挑战性的任务中实现最先进的性能,但有改善策略学习的余地以及如何使用违规数据。在本文中,我们重新审视了这些算法的理论基础,并提出了一种新的算法,该算法通过接近术语来稳定政策改进,该算法将连续政策引起的折扣状态行动探索分布稳定,才能彼此接近。这种近期术语以审查分布之间的分歧表示,以违规政策和对抗方式学习。我们经常表明,我们的提出方法可以对基准高维控制任务中的稳定性和提高最终性能具有有益影响。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号