A Collaborative Multiagent Reinforcement Learning Method Based on Policy Gradient Potential

Zhen Zhang; Yew-Soon Ong; Dongqing Wang; Binqiang Xue

首页> 外文期刊>Cybernetics, IEEE Transactions on >A Collaborative Multiagent Reinforcement Learning Method Based on Policy Gradient Potential

【24h】

A Collaborative Multiagent Reinforcement Learning Method Based on Policy Gradient Potential

机译：一种基于政策梯度潜力的协同多合作加固学习方法

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Gradient-based method has been extensively used in today’s multiagent reinforcement learning (MARL). In a gradient-based MARL algorithm, each agent updates its parameterized strategy in the direction of the gradient of some performance index. However, studies on the convergence of the existing gradient-based MARL algorithms for identical interest games are quite few. In this article, we propose a policy gradient potential (PGP) algorithm that takes PGP as the source of information for guiding the strategy update, as opposed to the gradient itself, to learn the optimal joint strategy that has a maximal global reward. Since the payoff matrix and the joint strategy are often unavailable to the learning agents in reality, we consider the probability of obtaining the maximal reward as the performance index. Theoretical analysis of the PGP algorithm on the continuous model involving an identical interest repeated game shows that if the component action of every optimal joint action is unique, the critical points corresponding to all optimal joint actions are asymptotically stable. The PGP algorithm is experimentally studied and compared against other MARL algorithms on two commonly used collaborative tasks—the robots leaving a room task and the distributed sensor network task, as well as a real-world minefield navigation problem where only local state and local reward information are available. The results show that the PGP algorithm outperforms the other algorithms in terms of the cumulative reward and the number of time steps used in an episode.

机译：基于梯度的方法在当今的多元素强化学习（Marl）中广泛使用。在基于梯度的MARL算法中，每个代理在某些性能索引的梯度方向上更新其参数化策略。然而，关于相同兴趣游戏的现有梯度基马尔算法的收敛性很少。在本文中，我们提出了一种策略梯度潜力（PGP）算法，该算法将PGP作为指导战略更新的信息来源，而不是梯度本身，以了解具有最大全球奖励的最佳联合策略。由于支付矩阵和联合战略往往无法对现实学习代理商不可用，因此我们认为将最大奖励作为性能指数获得最大奖励的可能性。 PGP算法对涉及相同兴趣的连续模型的理论分析，重复游戏表明，如果每个最佳关节动作的组件动作是唯一的，则对应于所有最佳关节动作的关键点是渐近稳定的。 PGP算法是通过实验研究的，并将其与其他两个常用的协作任务相比进行了比较，而是留下房间任务的机器人和分布式传感器网络任务，以及仅当地州和本地奖励信息的真实雷区导航问题可用。结果表明，PGP算法在累积奖励方面优于其他算法，以及集中中使用的时间步骤的次数。

著录项

来源
《Cybernetics, IEEE Transactions on》 |2021年第2期|1015-1027|共13页
作者
Zhen Zhang; Yew-Soon Ong; Dongqing Wang; Binqiang Xue;
展开▼
作者单位

School of Automation Qingdao University Qingdao China;

School of Science and Computer Engineering Nanyang Technological University Singapore;

School of Electrical Engineering Qingdao University Qingdao China;

School of Automation Qingdao University Qingdao China;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Games; Task analysis; Reinforcement learning; Performance analysis; Stochastic processes; Convergence; Collaboration;

机译：游戏;任务分析;加固学习;性能分析;随机过程;融合;合作;

相似文献

外文文献
中文文献
专利

1. Comparing Policy Gradient and Value Function Based Reinforcement Learning Methods in Simulated Electrical Power Trade [J] . Lincoln R., Galloway S., Stephen B., Power Systems, IEEE Transactions on . 2012,第1期

机译：模拟电力贸易中基于策略梯度和价值函数的强化学习方法比较
2. Spike-Based Reinforcement Learning in Continuous State and Action Space: When Policy Gradient Methods Fail [J] . Eleni Vasilaki, Nicolas Frémaux, Robert Urbanczik, PLoS Computational Biology . 2009,第12期

机译：连续状态和动作空间中基于峰值的强化学习：当策略梯度方法失败时
3. Training a robust reinforcement learning controller for the uncertain system based on policy gradient method [J] . Li Zhan, Xue Shengri, Lin Weiyang, Neurocomputing . 2018,第NOVa17期

机译：基于策略梯度法的不确定系统鲁棒强化学习控制器训练
4. Bias Correction in Reinforcement Learning via the Deterministic Policy Gradient Method for MPC-Based Policies [C] . Sébastien Gros, Mario Zanon Annual American Control Conference . 2021

机译：基于MPC的策略的确定性政策梯度方法偏置校正
5. Explaining Collective Behavior with Dynamical Systems: Spatial Gradient Sensing in Eukaryotic Chemotaxis and Learning Dynamics in Multiagent Reinforcement Learning [D] . Shams, Daniel . 2019

机译：用动力系统解释集体行为：多核化趋化性的空间梯度传感和多核强化学习中的学习动态
6. Correction: Spike-Based Reinforcement Learning in Continuous State and Action Space: When Policy Gradient Methods Fail [O] . Eleni Vasilaki, Nicolas Frémaux, Robert Urbanczik, 2009

机译：更正：在连续状态和动作空间中基于峰值的强化学习：当策略梯度方法失败时
7. Comparing policy gradient and value function based reinforcement learning methods in simulated electrical power trade [O] . Lincoln, Richard, Galloway, Stuart, Stephen, Bruce, 2012

机译：模拟电力贸易中基于策略梯度和价值函数的强化学习方法比较

A Collaborative Multiagent Reinforcement Learning Method Based on Policy Gradient Potential

摘要

著录项

相似文献

相关主题

期刊订阅