...
首页> 外文期刊>IEEE Transactions on Intelligent Transportation Systems >Deep Reinforcement Learning for Event-Driven Multi-Agent Decision Processes
【24h】

Deep Reinforcement Learning for Event-Driven Multi-Agent Decision Processes

机译:用于事件驱动的多主体决策过程的深度强化学习

获取原文
获取原文并翻译 | 示例
           

摘要

The incorporation of macro-actions (temporally extended actions) into multi-agent decision problems has the potential to address the curse of dimensionality associated with such decision problems. Since macro-actions last for stochastic durations, multiple agents executing decentralized policies in cooperative environments must act asynchronously. We present an algorithm that modifies generalized advantage estimation for temporally extended actions, allowing a state-of-the-art policy optimization algorithm to optimize policies in Dec-POMDPs in which agents act asynchronously. We show that our algorithm is capable of learning optimal policies in two cooperative domains, one involving real-time bus holding control and one involving wildfire fighting with unmanned aircraft. Our algorithm works by framing problems as "event-driven decision processes," which are scenarios in which the sequence and timing of actions and events are random and governed by an underlying stochastic process. In addition to optimizing policies with continuous state and action spaces, our algorithm also facilitates the use of event-driven simulators, which do not require time to be discretized into time-steps. We demonstrate the benefit of using event-driven simulation in the context of multiple agents taking asynchronous actions. We show that fixed time-step simulation risks obfuscating the sequence in which closely separated events occur, adversely affecting the policies learned. In addition, we show that arbitrarily shrinking the time-step scales poorly with the number of agents.
机译:将宏动作(临时扩展的动作)合并到多主体决策问题中具有解决与此类决策问题相关的维度诅咒的潜力。由于宏操作持续随机的时间,因此在协作环境中执行分散策略的多个代理必须异步操作。我们提出了一种算法,该算法修改了针对时间扩展动作的广义优势估计,从而允许最新的策略优化算法来优化代理异步执行的Dec-POMDP中的策略。我们证明了我们的算法能够在两个合作领域中学习最优策略,其中一个领域涉及实时公交车保持控制,另一个领域涉及与无人驾驶飞机进行野火战斗。我们的算法通过将问题定为“事件驱动的决策过程”来工作,在这些场景中,动作和事件的顺序和时间安排是随机的,并由潜在的随机过程控制。除了使用连续的状态和动作空间优化策略外,我们的算法还促进了事件驱动模拟器的使用,这些模拟器不需要将时间离散为时间步长。我们演示了在多个代理采取异步操作的情况下使用事件驱动的仿真的好处。我们表明,固定的时间步仿真可能会混淆发生紧密分离的事件的顺序,从而对所学习的策略产生不利影响。此外,我们表明随代理人数的增加,任意缩小时间步长比例很差。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号