【24h】

Nexus: Bringing Efficient and Scalable Training to Deep Learning Frameworks

机译:Nexus:为深度学习框架带来有效和可扩展的培训

获取原文
获取外文期刊封面目录资料

摘要

Demand is mounting in the industry for scalable GPU-based deep learning systems. Unfortunately, existing training applications built atop popular deep learning frameworks, including Caffe, Theano, and Torch, etc, are incapable of conducting distributed GPU training over large-scale clusters. To remedy such a situation, this paper presents Nexus, a platform that allows existing deep learning frameworks to easily scale out to multiple machines without sacrificing model accuracy. Nexus leverages recently proposed distributed parameter management architecture to orchestrate distributed training by a large number of learners spread across the cluster. Through characterizing the run-time behavior of existing single-node based applications, Nexus is equipped with a suite of optimization schemes, including hierarchical and hybrid parameter aggregation, enhanced network and computation layer, and quality-guided communication adjustment, etc, to strengthen the communication channels and resource utilization. Empirical evaluations with a diverse set of deep learning applications demonstrate that Nexus is easy to integrate and can deliver efficient distributed training services to major deep learning frameworks. In addition, Nexus's optimization schemes are highly effective to shorten the training time with targeted accuracy bounds.
机译:需求在行业中安装了可扩展的基于GPU的深度学习系统。遗憾的是,现有的培训应用程序建立了热门的深度学习框架,包括Caffe,Theano和火炬等,无法通过大规模集群进行分布式GPU训练。为了解决这种情况,本文介绍了Nexus,一个允许现有深度学习框架的平台,以便在不牺牲模型精度的情况下轻松扩展到多台机器。 Nexus借助最近提出的分布式参数管理架构,通过大量的学习者分布在群集中来协调分布式培训。通过表征现有的基于单节点的应用程序的运行时行为,Nexus配备了一套优化方案,包括分层和混合参数聚合,增强的网络和计算层,以及质量引导的通信调整等,以增强通信频道和资源利用率。具有多种深度学习应用的经验评估表明,Nexus易于集成,可以为主要的深度学习框架提供高效的分布式培训服务。此外,Nexus的优化方案非常有效地缩短具有目标精度范围的培训时间。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号