首页> 外文会议>International Conference on Autonomous Agents and Multiagent Systems >Can Sophisticated Dispatching Strategy Acquired by Reinforcement Learning?: A Case Study in Dynamic Courier Dispatching System
【24h】

Can Sophisticated Dispatching Strategy Acquired by Reinforcement Learning?: A Case Study in Dynamic Courier Dispatching System

机译:通过强化学习可以获得复杂的调度策略吗?:动态快递调度系统实例研究

获取原文

摘要

In this paper, we study a courier dispatching problem (CDP) raised from an online pickup-service platform of Alibaba. The CDP aims to assign a set of couriers to serve pickup requests with stochastic spatial and temporal arrival rate among urban regions. The objective is to maximize the revenue of served requests given a limited number of couriers over a period of time. Many online algorithms such as dynamic matching and vehicle routing strategy from existing literature could be applied to tackle this problem. However, these methods rely on appropriately predefined optimization objectives at each decision point, which is hard in dynamic situations. This paper formulates the CDP as a Markov decision process (MDP) and proposes a data-driven approach to derive the optimal dispatching rule-set under different scenarios. Our method stacks multi-layer images of the spatial-and-temporal map and apply multi-agent reinforcement learning (MARL) techniques to evolve dispatching models. This method solves the learning inefficiency caused by traditional centralized MDP modeling. Through comprehensive experiments on both artificial dataset and real-world dataset, we show: 1) By utilizing historical data and considering long-term revenue gains, MARL achieves better performance than myopic online algorithms; 2) MARL is able to construct the mapping between complex scenarios to sophisticated decisions such as the dispatching rule. 3) MARL has the scalability to adopt in large-scale real-world scenarios.
机译:本文研究了阿里巴巴在线取货服务平台提出的快递配送问题。CDP旨在分配一组信使,以随机的空间和时间到达率在城市地区之间提供接送请求。目标是在一段时间内,在有限数量的快递员的情况下,最大限度地提高已送达请求的收入。现有文献中的许多在线算法,如动态匹配和车辆路径策略,都可以用来解决这个问题。然而,这些方法依赖于每个决策点上适当预定义的优化目标,这在动态情况下很难实现。本文将CDP描述为马尔可夫决策过程(MDP),并提出了一种数据驱动的方法来推导不同场景下的最优调度规则集。我们的方法将时空地图的多层图像堆叠起来,并应用多智能体强化学习(MARL)技术来进化调度模型。该方法解决了传统集中式MDP建模带来的学习效率低下的问题。通过对人工数据集和真实数据集的综合实验,我们发现:1)通过利用历史数据并考虑长期收益收益,MARL算法比短视的在线算法取得了更好的性能;2) MARL能够构建复杂场景到复杂决策(如调度规则)之间的映射。3) MARL具有在大规模现实场景中采用的可伸缩性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号