首页> 外文期刊>Cybernetics, IEEE Transactions on >A Collaborative Multiagent Reinforcement Learning Method Based on Policy Gradient Potential
【24h】

A Collaborative Multiagent Reinforcement Learning Method Based on Policy Gradient Potential

机译:一种基于政策梯度潜力的协同多合作加固学习方法

获取原文
获取原文并翻译 | 示例
           

摘要

Gradient-based method has been extensively used in today’s multiagent reinforcement learning (MARL). In a gradient-based MARL algorithm, each agent updates its parameterized strategy in the direction of the gradient of some performance index. However, studies on the convergence of the existing gradient-based MARL algorithms for identical interest games are quite few. In this article, we propose a policy gradient potential (PGP) algorithm that takes PGP as the source of information for guiding the strategy update, as opposed to the gradient itself, to learn the optimal joint strategy that has a maximal global reward. Since the payoff matrix and the joint strategy are often unavailable to the learning agents in reality, we consider the probability of obtaining the maximal reward as the performance index. Theoretical analysis of the PGP algorithm on the continuous model involving an identical interest repeated game shows that if the component action of every optimal joint action is unique, the critical points corresponding to all optimal joint actions are asymptotically stable. The PGP algorithm is experimentally studied and compared against other MARL algorithms on two commonly used collaborative tasks—the robots leaving a room task and the distributed sensor network task, as well as a real-world minefield navigation problem where only local state and local reward information are available. The results show that the PGP algorithm outperforms the other algorithms in terms of the cumulative reward and the number of time steps used in an episode.
机译:基于梯度的方法在当今的多元素强化学习(Marl)中广泛使用。在基于梯度的MARL算法中,每个代理在某些性能索引的梯度方向上更新其参数化策略。然而,关于相同兴趣游戏的现有梯度基马尔算法的收敛性很少。在本文中,我们提出了一种策略梯度潜力(PGP)算法,该算法将PGP作为指导战略更新的信息来源,而不是梯度本身,以了解具有最大全球奖励的最佳联合策略。由于支付矩阵和联合战略往往无法对现实学习代理商不可用,因此我们认为将最大奖励作为性能指数获得最大奖励的可能性。 PGP算法对涉及相同兴趣的连续模型的理论分析,重复游戏表明,如果每个最佳关节动作的组件动作是唯一的,则对应于所有最佳关节动作的关键点是渐近稳定的。 PGP算法是通过实验研究的,并将其与其他两个常用的协作任务相比进行了比较,而是留下房间任务的机器人和分布式传感器网络任务,以及仅当地州和本地奖励信息的真实雷区导航问题可用。结果表明,PGP算法在累积奖励方面优于其他算法,以及集中中使用的时间步骤的次数。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号