首页> 外文会议>IEEE International Conference on Autonomic Computing >Speeding up Deep Learning with Transient Servers
【24h】

Speeding up Deep Learning with Transient Servers

机译:使用瞬态服务器加快深入学习

获取原文

摘要

Distributed training frameworks, like TensorFlow, have been proposed as a means to reduce the training time of deep learning models by using a cluster of GPU servers. While such speedups are often desirable-e.g., for rapidly evaluating new model designs-they often come with significantly higher monetary costs due to sublinear scalability. In this paper, we investigate the feasibility of using training clusters composed of cheaper transient GPU servers to get the benefits of distributed training without the high costs. We conduct the first large-scale empirical analysis, launching more than a thousand GPU servers of various capacities, aimed at understanding the characteristics of transient GPU servers and their impact on distributed training performance. Our study demonstrates the potential of transient servers with a speedup of 7.7X with more than 62.9% monetary savings for some cluster configurations. We also identify a number of important challenges and opportunities for redesigning distributed training frameworks to be transient-aware. For example, the dynamic cost and availability characteristics of transient servers suggest the need for frameworks to dynamically change cluster configurations to best take advantage of current conditions.
机译:已经提出了分布式训练框架,如Tensorflow,作为通过使用GPU服务器集群来减少深度学习模型的培训时间的方法。虽然这种加速通常是可取的 - 例如,用于迅速评估新的模型设计 - 由于乘以扩展性,它们通常具有显着提高的货币成本。在本文中,我们调查了使用便宜的瞬态GPU服务器组成的培训集群的可行性,以获得分布式培训的好处,而无需高成本。我们开展了第一个大规模的实证分析,推出了各种能力的一千多个GPU服务器,旨在了解瞬态GPU服务器的特点及其对分布式培训表现的影响。我们的研究展示了瞬态服务器的潜力,加速为7.7倍,对于某些群集配置超过62.9%的货币节省。我们还确定了许多重要的挑战和重新设计分布式培训框架的机会,以瞬态感知。例如,瞬态服务器的动态成本和可用性特征表明需要框架以动态更改群集配置,以最佳利用当前条件。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号