首页> 外文会议>IEEE International Conference on Big Data >A Topology-Aware Performance Prediction Model for Distributed Deep Learning on GPU Clusters
【24h】

A Topology-Aware Performance Prediction Model for Distributed Deep Learning on GPU Clusters

机译:GPU集群分布式深度学习的拓扑感知性能预测模型

获取原文

摘要

Today, multi-GPU training has become a common practice for deep learning workloads. The performance of a training job could be affected significantly by both the GPU connectivity in the system topology and the computation-communication pattern of the job. This highlights the necessity of the awareness of jobs’ performance characteristics for cluster schedulers to improve both job and cluster efficiency.In this paper, we propose an online resource-performance model for deep learning training jobs on GPU clusters. This model can estimate the training speed as a function of any given resource setting (i.e., the number and locality of GPUs) for a specific job. The model is based on systematic modeling of the system topology and the communication patterns of individual jobs with online fitting on a sample set of profiled performance data. Experiments show that our performance model achieves 94% prediction accuracy on average (up to 99.9%). Additionally, a large-scale simulation on a real production trace demonstrates that our model helps a typical scheduling algorithm decrease average job completion time by 3.4x and makespan by 1.7x.
机译:今天,多GPU培训已成为深度学习工作负载的常见做法。通过系统拓扑中的GPU连通性和作业的计算通信模式,GPU连通性可能会受到显着影响。这突出了乔布斯调度员对乔布斯的性能特征的认识的必要性,以改善作业和集群效率。在本文中,我们提出了一个在GPU集群上深入学习培训工作的在线资源 - 性能模型。该模型可以估算特定作业的任何给定资源设置(即GPU)的函数的训练速度。该模型基于系统拓扑的系统建模和在样本集的在线拟合的各个作业的通信模式。实验表明,我们的性能模型平均达到了94%的预测精度(高达99.9%)。此外,实际生产线上的大规模仿真表明我们的模型有助于典型的调度算法将平均工作完成时间减少3.4x和makespan 1.7x。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号