首页> 外文期刊>Parallel and Distributed Systems, IEEE Transactions on >Deep Learning Research and Development Platform: Characterizing and Scheduling with QoS Guarantees on GPU Clusters
【24h】

Deep Learning Research and Development Platform: Characterizing and Scheduling with QoS Guarantees on GPU Clusters

机译:深度学习研发平台:在GPU集群上表征和调度QoS保证

获取原文
获取原文并翻译 | 示例

摘要

Deep learning (DL) has been widely adopted in various domains of artificial intelligence (AI), achieving dramatic developments in industry and academia. Besides giant AI companies, numerous small and medium-sized enterprises, institutes, and universities (EIUs) have focused on the research and development (R&D) of DL. Considering the high cost of datacenters and high performance computing (HPC) systems, EIUs prefer adopting off-the-shelf GPU clusters as a DL R&D platform for multiple users and developers to process diverse DL workloads. In such scenarios, the scheduling of multiple DL tasks on a shared GPU cluster is both significant and challenging in terms of efficiently utilizing limited resources. Existing schedulers cannot predict the resource requirements of diverse DL workloads, leading to the under-utilization of computing resources and a decline in user satisfaction. This paper proposes GENIE, a QoS-aware dynamic scheduling framework for a shared GPU cluster, which achieves users' QoS guarantee and high system utilization. In accordance with an exhaustive characterization, GENIE analyzes the key factors that affect the performance of DL tasks and proposes a prediction model derived from lightweight profiling to estimate the processing rate and response latency for diverse DL workloads. Based on the prediction models, we propose a QoS-aware scheduling algorithm to identify the best placements for DL tasks and schedule them on the shared cluster. Experiments on a GPU cluster and large-scale simulations demonstrate that GENIE achieves a QoS-guarantee percentage improvement of up to 67.4 percent and a makespan reduction of up to 28.2 percent, compared to other baseline schedulers.
机译:深度学习(DL)已广泛采用人工智能(AI)的各个领域,实现工业和学术界的巨大发展。除了巨大的AI公司,众多中小企业,机构和大学(EIUS)都集中在DL的研究和开发(R&D)。考虑到数据中心和高性能计算(HPC)系统的高成本,EIUS更喜欢采用现成的GPU集群作为多个用户和开发人员的DL研发平台,以处理各种DL工作负载。在这种情况下,共享GPU集群上的多个DL任务的调度在有效利用有限的资源方面都是显着的并且具有挑战性。现有调度程序无法预测不同DL工作负载的资源需求,导致计算资源的利用率和用户满意度下降。本文提出了一个共享GPU群集的QoS感知动态调度框架的Genie,它实现了用户的QoS保证和高系统利用率。根据详尽的表征,Genie分析了影响DL任务性能的关键因素,并提出了从轻质分析导出的预测模型,以估计各种DL工作负载的处理速率和响应延迟。基于预测模型,我们提出了一种QoS感知的调度算法来识别DL任务的最佳展示位置,并在共享群集中安排它们。 GPU集群的实验和大规模模拟表明,与其他基线调度人员相比,Genie达到了QoS - 保证率高,尺寸高达67.4%,而且施工减少高达28.2%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号