Multi-Agent Reinforcement Learning (MARL) algorithms face two main difficulties: the curse of dimensionality, and environment non-stationarity due to the independent learning processes carried out by the agents concurrently. In this paper we formalize and prove the convergence of a Distributed Round Robin Q-learning (D-RR-QL) algorithm for cooperative systems. The computational complexity of this algorithm increases linearly with the number of agents. Moreover, it eliminates environment non sta tionarity by carrying a round-robin scheduling of the action selection and execution. That this learning scheme allows the implementation of Modular State-Action Vetoes (MSAV) in cooperative multi-agent systems, which speeds up learning convergence in over-constrained systems by vetoing state-action pairs which lead to undesired termination states (UTS) in the relevant state-action subspace. Each agent’s local state-action value function learning is an independent process, including the MSAV policies. Coordination of locally optimal policies to obtain the global optimal joint policy is achieved by a greedy selection procedure using message passing. We show that D-RR-QL improves over state-of-the-art approaches, such as Distributed Q-Learning, Team Q-Learning and Coordinated Reinforcement Learning in a paradigmatic Linked Multi-Component Robotic System (L-MCRS) control problem: the hose transportation task. L-MCRS are over-constrained systems with many UTS induced by the interaction of the passive linking element and the active mobile robots.
展开▼
机译:多主体强化学习(MARL)算法面临两个主要困难:维度的诅咒和由于代理同时执行的独立学习过程而导致的环境不稳定。在本文中,我们形式化并证明了用于协作系统的分布式Round Robin Q学习(D-RR-QL)算法的收敛性。该算法的计算复杂度随着代理的数量线性增加。此外,它通过进行动作选择和执行的循环调度来消除环境的不稳定。该学习方案允许在协作式多智能体系统中实施模块化状态行动否决权(MSAV),从而通过否决导致行动中不希望的终止状态(UTS)的状态行动对来加快过度约束系统中的学习收敛。相关的状态动作子空间。每个代理的本地状态行为价值功能学习是一个独立的过程,包括MSAV策略。通过使用消息传递的贪婪选择过程来实现局部最优策略的协调以获得全局最优联合策略。我们展示了D-RR-QL改进了最新方法,例如范式链接多组件机器人系统(L-MCRS)控制问题中的分布式Q学习,团队Q学习和协同强化学习:软管运输任务。 L-MCRS是过度约束的系统,由于被动链接元素和主动移动机器人的相互作用而导致许多UTS。
展开▼