A Topology-Aware Performance Prediction Model for Distributed Deep Learning on GPU Clusters

机译：GPU集群分布式深度学习的拓扑感知性能预测模型

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Today, multi-GPU training has become a common practice for deep learning workloads. The performance of a training job could be affected significantly by both the GPU connectivity in the system topology and the computation-communication pattern of the job. This highlights the necessity of the awareness of jobs’ performance characteristics for cluster schedulers to improve both job and cluster efficiency.In this paper, we propose an online resource-performance model for deep learning training jobs on GPU clusters. This model can estimate the training speed as a function of any given resource setting (i.e., the number and locality of GPUs) for a specific job. The model is based on systematic modeling of the system topology and the communication patterns of individual jobs with online fitting on a sample set of profiled performance data. Experiments show that our performance model achieves 94% prediction accuracy on average (up to 99.9%). Additionally, a large-scale simulation on a real production trace demonstrates that our model helps a typical scheduling algorithm decrease average job completion time by 3.4x and makespan by 1.7x.

机译：今天，多GPU培训已成为深度学习工作负载的常见做法。通过系统拓扑中的GPU连通性和作业的计算通信模式，GPU连通性可能会受到显着影响。这突出了乔布斯调度员对乔布斯的性能特征的认识的必要性，以改善作业和集群效率。在本文中，我们提出了一个在GPU集群上深入学习培训工作的在线资源 - 性能模型。该模型可以估算特定作业的任何给定资源设置（即GPU）的函数的训练速度。该模型基于系统拓扑的系统建模和在样本集的在线拟合的各个作业的通信模式。实验表明，我们的性能模型平均达到了94％的预测精度（高达99.9％）。此外，实际生产线上的大规模仿真表明我们的模型有助于典型的调度算法将平均工作完成时间减少3.4x和makespan 1.7x。

著录项

来源
《IEEE International Conference on Big Data》|2020年|2795-2801|共7页
会议地点
作者
Zheyu Lin; Xukun Chen; Hanyu Zhao; Yunteng Luan; Zhi Yang; Yafei Dai;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Training; Deep learning; Fitting; Graphics processing units; Predictive models; Data models; Topology;

机译：培训;深入学习;拟合;图形处理单元;预测模型;数据模型;拓扑;

相似文献

外文文献
中文文献
专利

1. Distributed training strategies for a computer vision deep learning algorithm on a distributed GPU cluster [J] . Víctor Campos, Francesc Sastre, Maurici Yagües, Procedia Computer Science . 2017,第1期

机译：分布式GPU集群上计算机视觉深度学习算法的分布式训练策略
2. Optimizing execution for pipelined-based distributed deep learning in a heterogeneously networked GPU cluster [J] . Zhang Jinghui, Zhan Jun, Li Jiange, Concurrency, practice and experience . 2020,第23期

机译：在异构联网GPU集群中优化基于流水线的分布式深度学习的执行
3. An FPGA-Based Optimizer Design for Distributed Deep Learning with Multiple GPUs [J] . Tomoya ITSUBO, Michihiro KOIBUCHI, Hideharu AMANO, IEICE transactions on information and systems . 2021,第12期

机译：一种基于FPGA的优化设计，用于多个GPU的分布式深度学习
4. Detailed Performance Analysis of Distributed Tensorflow on a GPU Cluster using Deep Learning Algorithms [C] . Abid Malik, Micheal Lu, Nathenial Wang, New York Scientific Data Summit . 2018

机译：使用深度学习算法的GPU集群上分布式Tensorflow的详细性能分析
5. Co-Location of Deep Learning Jobs in GPU Clusters [D] . Tabrizian, Iman. 2021

机译：GPU集群中深度学习工作的共同位置
6. Smarter Traffic Prediction Using Big Data In-Memory Computing Deep Learning and GPUs [O] . Muhammad Aqib, Rashid Mehmood, Ahmed Alzahrani, 2019

机译：使用大数据内存计算深度学习和GPU进行更智能的流量预测
7. Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs [O] . Shaohuai Shi, Qiang Wang, Xiaowen Chu 2018

机译：GPU分布式深度学习框架的性能建模与评价

A Topology-Aware Performance Prediction Model for Distributed Deep Learning on GPU Clusters

摘要

著录项

相似文献

相关主题

期刊订阅