首页> 外文会议>Conference on Neural Information Processing Systems >Constrained Reinforcement Learning Has Zero Duality Gap
【24h】

Constrained Reinforcement Learning Has Zero Duality Gap

机译:受限增强学习具有零二重性差距

获取原文

摘要

Autonomous agents must often deal with conflicting requirements, such as completing tasks using the least amount of time/energy, learning multiple tasks, or dealing with multiple opponents. In the context of reinforcement learning (RL), these problems are addressed by (i) designing a reward function that simultaneously describes all requirements or (ii) combining modular value functions that encode them individually. Though effective, these methods have critical downsides. Designing good reward functions that balance different objectives is challenging, especially as the number of objectives grows. Moreover, implicit interference between goals may lead to performance plateaus as they compete for resources, particularly when training on-policy. Similarly, selecting parameters to combine value functions is at least as hard as designing an all-encompassing reward, given that the effect of their values on the overall policy is not straightforward. The later is generally addressed by formulating the conflicting requirements as a constrained RL problem and solved using Primal-Dual methods. These algorithms are in general not guaranteed to converge to the optimal solution since the problem is not convex. This work provides theoretical support to these approaches by establishing that despite its non-convexity, this problem has zero duality gap, i.e., it can be solved exactly in the dual domain, where it becomes convex. Finally, we show this result basically holds if the policy is described by a good parametrization (e.g., neural networks) and we connect this result with primal-dual algorithms present in the literature and we establish the convergence to the optimal solution.
机译:自治代理必须经常处理冲突的要求,例如使用最少的时间/能量,学习多个任务或处理多个对手的任务。在强化学习(RL)的背景下,(i)设计了这些问题,设计了一个奖励函数,它同时描述了单独编码编码它们的模块化值函数的所有要求或(ii)。虽然有效,但这些方法具有关键的缺点。设计平衡不同目标的良好奖励功能是具有挑战性的,特别是随着目标的成长数量。此外,目标之间的隐式干扰可能导致性能平稳,因为它们竞争资源,特别是在培训政策时。类似地,选择要组合价值函数的参数至少是难以设计全包奖励,因为它们的价值观对整体政策的影响并不简单。通常通过将冲突的要求作为受约束的RL问题的冲突要求和使用原始方法进行解决来解决稍后的解决。这些算法通常不保证由于问题不凸起,因此不保证收敛于最佳解决方案。这项工作通过建立其非凸起,为这些方法提供了理论支持,尽管其非凸起,但该问题具有零二元间隙,即,它可以完全在双域中解决,在那里它变得凸起。最后,我们显示出该结果基本上是通过良好的参数化(例如神经网络)描述了策略,并且我们通过文献中存在的原始双算法连接这一结果,并建立了最佳解决方案的收敛性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号