Deep Learning Research and Development Platform: Characterizing and Scheduling with QoS Guarantees on GPU Clusters

Zhaoyun Chen; Wei Quan; Mei Wen; Jianbin Fang; Jie Yu; Chunyuan Zhang; Lei Luo

首页> 外文期刊>Parallel and Distributed Systems, IEEE Transactions on >Deep Learning Research and Development Platform: Characterizing and Scheduling with QoS Guarantees on GPU Clusters

【24h】

Deep Learning Research and Development Platform: Characterizing and Scheduling with QoS Guarantees on GPU Clusters

机译：深度学习研发平台：在GPU集群上表征和调度QoS保证

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Deep learning (DL) has been widely adopted in various domains of artificial intelligence (AI), achieving dramatic developments in industry and academia. Besides giant AI companies, numerous small and medium-sized enterprises, institutes, and universities (EIUs) have focused on the research and development (R&D) of DL. Considering the high cost of datacenters and high performance computing (HPC) systems, EIUs prefer adopting off-the-shelf GPU clusters as a DL R&D platform for multiple users and developers to process diverse DL workloads. In such scenarios, the scheduling of multiple DL tasks on a shared GPU cluster is both significant and challenging in terms of efficiently utilizing limited resources. Existing schedulers cannot predict the resource requirements of diverse DL workloads, leading to the under-utilization of computing resources and a decline in user satisfaction. This paper proposes GENIE, a QoS-aware dynamic scheduling framework for a shared GPU cluster, which achieves users' QoS guarantee and high system utilization. In accordance with an exhaustive characterization, GENIE analyzes the key factors that affect the performance of DL tasks and proposes a prediction model derived from lightweight profiling to estimate the processing rate and response latency for diverse DL workloads. Based on the prediction models, we propose a QoS-aware scheduling algorithm to identify the best placements for DL tasks and schedule them on the shared cluster. Experiments on a GPU cluster and large-scale simulations demonstrate that GENIE achieves a QoS-guarantee percentage improvement of up to 67.4 percent and a makespan reduction of up to 28.2 percent, compared to other baseline schedulers.

机译：深度学习（DL）已广泛采用人工智能（AI）的各个领域，实现工业和学术界的巨大发展。除了巨大的AI公司，众多中小企业，机构和大学（EIUS）都集中在DL的研究和开发（R＆D）。考虑到数据中心和高性能计算（HPC）系统的高成本，EIUS更喜欢采用现成的GPU集群作为多个用户和开发人员的DL研发平台，以处理各种DL工作负载。在这种情况下，共享GPU集群上的多个DL任务的调度在有效利用有限的资源方面都是显着的并且具有挑战性。现有调度程序无法预测不同DL工作负载的资源需求，导致计算资源的利用率和用户满意度下降。本文提出了一个共享GPU群集的QoS感知动态调度框架的Genie，它实现了用户的QoS保证和高系统利用率。根据详尽的表征，Genie分析了影响DL任务性能的关键因素，并提出了从轻质分析导出的预测模型，以估计各种DL工作负载的处理速率和响应延迟。基于预测模型，我们提出了一种QoS感知的调度算法来识别DL任务的最佳展示位置，并在共享群集中安排它们。 GPU集群的实验和大规模模拟表明，与其他基线调度人员相比，Genie达到了QoS - 保证率高，尺寸高达67.4％，而且施工减少高达28.2％。

著录项

来源
《Parallel and Distributed Systems, IEEE Transactions on》 |2020年第1期|34-50|共17页
作者
Zhaoyun Chen; Wei Quan; Mei Wen; Jianbin Fang; Jie Yu; Chunyuan Zhang; Lei Luo;
展开▼
作者单位

Department of Computer National University of Defense Technology Changsha China;

Department of Computer National University of Defense Technology Changsha China;

Department of Computer National University of Defense Technology Changsha China;

Department of Computer National University of Defense Technology Changsha China;

Department of Computer National University of Defense Technology Changsha China;

Department of Computer National University of Defense Technology Changsha China;

Department of Computer National University of Defense Technology Changsha China;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Task analysis; Graphics processing units; Research and development; Quality of service; Job shop scheduling; Training; Predictive models;

机译：任务分析;图形处理单元;研发;服务质量;工作商店调度;培训;预测模型;

相似文献

外文文献
中文文献
专利

1. A hybrid GPU cluster and volunteer computing platform for scalable deep learning [J] . Kijsipongse Ekasit, Piyatumrong Apivadee, U-ruekolan Suriya Journal of supercomputing . 2018,第7期

机译：用于可扩展深度学习的混合GPU集群和志愿者计算平台
2. DL2: A Deep Learning-Driven Scheduler for Deep Learning Clusters [J] . Peng Yanghua, Bao Yixin, Chen Yangrui, IEEE Transactions on Parallel and Distributed Systems . 2021,第8期

机译：DL2：深度学习群集的深度学习驱动的调度程序
3. QoS-TEOS：QoS Guaranteed Throughput-Efficiency Optimal Distributed Scheduling in WiMAX Mesh Networks [J] . 滕达, 杨寿保, 赫卫卿, 计算机科学技术学报：英文版 . 2010,第005期

机译：QoS-TEOS：WiMAX网状网络中QoS保证的吞吐量效率最佳分布式调度
4. Poster Abstract: Deep Learning Workloads Scheduling with Reinforcement Learning on GPU Clusters [C] . Zhaoyun Chen, Lei Luo, Wei Quan, IEEE Conference on Computer Communications Workshops . 2019

机译：海报摘要：深度学习工作负载调度与GPU集群上的强化学习
5. Co-Location of Deep Learning Jobs in GPU Clusters [D] . Tabrizian, Iman. 2021

机译：GPU集群中深度学习工作的共同位置
6. Smarter Traffic Prediction Using Big Data In-Memory Computing Deep Learning and GPUs [O] . Muhammad Aqib, Rashid Mehmood, Ahmed Alzahrani, 2019

机译：使用大数据内存计算深度学习和GPU进行更智能的流量预测
7. DL2: A Deep Learning-Driven Scheduler for Deep Learning Clusters [O] . Yanghua Peng, Yixin Bao, Yangrui Chen, 2021

机译：DL2：深度学习群集的深度学习驱动的调度程序

Deep Learning Research and Development Platform: Characterizing and Scheduling with QoS Guarantees on GPU Clusters

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅