首页> 外文会议>IEEE International Conference on Autonomic Computing >Speeding up Deep Learning with Transient Servers

【24h】

Speeding up Deep Learning with Transient Servers

机译：使用瞬态服务器加快深入学习

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Distributed training frameworks, like TensorFlow, have been proposed as a means to reduce the training time of deep learning models by using a cluster of GPU servers. While such speedups are often desirable-e.g., for rapidly evaluating new model designs-they often come with significantly higher monetary costs due to sublinear scalability. In this paper, we investigate the feasibility of using training clusters composed of cheaper transient GPU servers to get the benefits of distributed training without the high costs. We conduct the first large-scale empirical analysis, launching more than a thousand GPU servers of various capacities, aimed at understanding the characteristics of transient GPU servers and their impact on distributed training performance. Our study demonstrates the potential of transient servers with a speedup of 7.7X with more than 62.9% monetary savings for some cluster configurations. We also identify a number of important challenges and opportunities for redesigning distributed training frameworks to be transient-aware. For example, the dynamic cost and availability characteristics of transient servers suggest the need for frameworks to dynamically change cluster configurations to best take advantage of current conditions.

机译：已经提出了分布式训练框架，如Tensorflow，作为通过使用GPU服务器集群来减少深度学习模型的培训时间的方法。虽然这种加速通常是可取的 - 例如，用于迅速评估新的模型设计 - 由于乘以扩展性，它们通常具有显着提高的货币成本。在本文中，我们调查了使用便宜的瞬态GPU服务器组成的培训集群的可行性，以获得分布式培训的好处，而无需高成本。我们开展了第一个大规模的实证分析，推出了各种能力的一千多个GPU服务器，旨在了解瞬态GPU服务器的特点及其对分布式培训表现的影响。我们的研究展示了瞬态服务器的潜力，加速为7.7倍，对于某些群集配置超过62.9％的货币节省。我们还确定了许多重要的挑战和重新设计分布式培训框架的机会，以瞬态感知。例如，瞬态服务器的动态成本和可用性特征表明需要框架以动态更改群集配置，以最佳利用当前条件。

著录项

来源
《IEEE International Conference on Autonomic Computing 》|2019年|xvii 208 p. :|共11页
会议地点
作者
Shijian Li; Robert J. Walls; Lijie Xu; Tian Guo;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类计算技术、计算机技术 ;
关键词
Servers; Training; Transient analysis; Graphics processing units; Computational modeling; Deep learning; Google;

机译：服务器;训练;瞬态分析;图形处理单元;计算建模;深度学习;谷歌;

相似文献

外文文献
中文文献
专利

1. Speeding up of the traffic congestion mitigation by stochastic optimization in deep learning [J] . Shinnnosuke Nakamura, Takumi Uemura, Gou Koutaki, Nonlinear Theory and Its Applications . 2018 ,第1期

机译：通过深度学习中的随机优化来加速缓解交通拥堵
2. Speeding Up Deep Learning with the GPU [J] . K. Wong Desktop engineering . 2015 ,第9期

机译：使用GPU加速深度学习
3. Speeding-Up P-256 ECDSA Verification on x86-64 Servers [J] . Nir Drucker, Shay Gueron IEEE Letters of the Computer Society . 2019 ,第2期

机译：在x86-64服务器上加快P-256 ECDSA验证
4. Speeding up Deep Learning with Transient Servers [C] . Shijian Li, Robert J. Walls, Lijie Xu, IEEE International Conference on Autonomic Computing . 2019

机译：使用瞬态服务器加速深度学习
5. Deep Learning Frameworks to Improve Inter-Observer Variability in CT Measurement of Solid Tumor [D] . Woo, MinJae. 2021

机译：深度学习框架，以改善实体肿瘤CT测量的观察者间变异性
6. Sequential Model Based Intrusion Detection System for IoT Servers Using Deep Learning Methods [O] . Ming Zhong, Yajin Zhou, Gang Chen 2021

机译：基于SEAT学习方法的IOT服务器的顺序模型入侵检测系统
7. Learning with Training Wheels: Speeding up Training with a Simple Controller for Deep Reinforcement Learning [O] . Linhai Xie, Sen Wang, Stefano Rosa, 2018

机译：用训练轮子学习：加速培训，用简单的控制器进行深度加强学习

Speeding up Deep Learning with Transient Servers

摘要

著录项

相似文献

相关主题

期刊订阅