Stable Policy Optimization via Off-Policy Divergence Regularization

Ahmed Touati; Amy Zhang; Joelle Pineau; Pascal Vincent

首页> 外文期刊>JMLR: Workshop and Conference Proceedings >Stable Policy Optimization via Off-Policy Divergence Regularization

【24h】

Stable Policy Optimization via Off-Policy Divergence Regularization

机译：通过脱助政策发散正规化稳定的政策优化

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL). While these methods achieve state-of-the-art performance across a wide range of challenging tasks, there is room for improvement in the stabilization of the policy learning and how the off-policy data are used. In this paper we revisit the theoretical foundations of these algorithms and propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another. This proximity term, expressed in terms of the divergence between the visitation distributions, is learned in an off-policy and adversarial manner. We empirically show that our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.

机译：信任区域政策优化（TRPO）和近端政策优化（PPO）是深度加强学习（RL）中最成功的政策梯度方法。虽然这些方法在广泛的具有挑战性的任务中实现最先进的性能，但有改善策略学习的余地以及如何使用违规数据。在本文中，我们重新审视了这些算法的理论基础，并提出了一种新的算法，该算法通过接近术语来稳定政策改进，该算法将连续政策引起的折扣状态行动探索分布稳定，才能彼此接近。这种近期术语以审查分布之间的分歧表示，以违规政策和对抗方式学习。我们经常表明，我们的提出方法可以对基准高维控制任务中的稳定性和提高最终性能具有有益影响。

著录项

来源
《JMLR: Workshop and Conference Proceedings》 |2020年第2010期|共10页
作者
Ahmed Touati; Amy Zhang; Joelle Pineau; Pascal Vincent;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Stable Conical Regularization by Constructible Dilating Cones with an Application to L-p-constrained Optimization Problems [J] . Jadamba Baasansuren, Khan Akhtar A., Sama Miguel Taiwanese journal of mathematics . 2019,第4期

机译：通过适用于L-P限制优化问题的结构扩张锥体稳定的锥形规范化
2. A tractable approach for designing piecewise affine policies in two-stage adjustable robust optimization [J] . Mathematical Programming . 2020,第1a2期

机译：一种在两阶段可调稳健优化中设计分段仿射政策的易手工方法
3. Piecewise static policies for two-stage adjustable robust linear optimization [J] . El Housni Omar, Goyal Vineet Mathematical Programming . 2018,第2期

机译：两级可调稳健线性优化的分段静态策略
4. Fast and Stable Learning of Quasi-Passive Dynamic Walking by an Unstable Biped Robot based on Off-Policy Natural Actor-Critic [C] . Tsuyoshi UENO, Yutaka NAKAMURA, Takashi TAKUMA, IEEE/RSJ International Conference on Intelligent Robots and Systems . 2006

机译：基于禁止政策自然演员的不稳定双层机器人，快速稳定地学习准无源动力行走
5. Machine Learning for Decision Making: Applications to Off-Policy Learning and Combinatorial Optimization [D] . Lu, Hao. 2021

机译：机器学习决策：禁止禁止学习和组合优化的应用
6. Off-Policy Evaluation of the Performance of a Robot Swarm: Importance Sampling to Assess Potential Modifications to the Finite-State Machine That Controls the Robots [O] . Federico Pagnozzi, Mauro Birattari 2021

机译：对机器人群体性能的违规评估：重要的采样以评估对控制机器人的有限状态机的潜在修改
7. Off-policy Q-learning: set-point design for optimizing dual-rate rougher flotation operational processes [O] . Li, J, Chai, T Y, Lewis, F, 2027

机译：非政策性Q学习：用于优化双速率粗加工浮选操作过程的设定点设计

Stable Policy Optimization via Off-Policy Divergence Regularization

摘要

著录项

相似文献

相关主题

期刊订阅