首页> 外文期刊>Connection Science >Learning adversarial policy in multiple scenes environment via multi-agent reinforcement learning
【24h】

Learning adversarial policy in multiple scenes environment via multi-agent reinforcement learning

机译:通过多功能钢筋学习在多个场景环境中学习对抗性政策

获取原文
获取原文并翻译 | 示例
           

摘要

Learning adversarial policy aims to learn behavioural strategies for agents with different goals, is one of the most significant tasks in multi-agent systems. Multi-agent reinforcement learning (MARL), as a state-of-the-art learning-based model, employs centralised or decentralised control methods to learn behavioural strategies by interacting with environments. It suffers from instability and slowness in the training process. Considering that parallel simulation or computation is an effective way to improve training performance, we propose a novel MARL method called Multiple scenes multi-agent proximal Policy Optimisation (MPO) in this paper. In MPO, we first simulate multiple parallel scenes in the training environment. Multiple policies control different agents in the same scene, and each policy also controls several identical agents from multiple scenes. Then, we expand proximal policy optimisation (PPO) with an improved actor-critic network, ensuring the stability of training in multi-agent tasks. The actor network only uses local information for decision making, and the critic network uses global information for training. Finally, effective training trajectories are computed with two criteria from multiple parallel scenes rather than single to accelerate the learning process. We evaluate our approach in two simulated 3D environments, one of which is Unity's official open-source soccer game, and the other is unmanned surface vehicles (USVs) built by Unity. Experiments demonstrate that MPO converges more stable and faster than benchmark methods in model training, and demonstrates excellent adversarial policy compared with benchmark models.
机译:学习对抗性政策旨在学习具有不同目标的代理的行为策略,是多助理系统中最重要的任务之一。多智能体增强学习(Marl)作为基于最先进的学习的模型,采用集中或分散的控制方法来通过与环境进行交互来学习行为策略。训练过程中可能存在不稳定和缓慢。考虑到并行仿真或计算是提高培训性能的有效方法,我们提出了一种新的MARL方法,称为多个场景多代理近端策略优化(MPO)。在MPO中,我们首先在培训环境中模拟多个并行场景。多个策略在同一场景中控制不同的代理,每个策略还可以从多个场景中控制多个相同的代理。然后,我们通过改进的演员 - 批评网络扩展近端策略优化(PPO),确保在多代理任务中培训的稳定性。演员网络仅使用决策的本地信息,批评网络使用全球信息进行培训。最后,有效的训练轨迹使用来自多个并行场景的两个标准而不是单个标准来加速学习过程。我们在两个模拟3D环境中评估我们的方法,其中一个是Unity的官方开源足球比赛,另一个是由Unity构建的无人面车辆(USV)。实验表明,与模型训练中的基准方法,MPO会收敛更稳定,更快,并与基准模型相比,展示了优异的对抗政策。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号