首页> 外文会议>IEEE International Conference on Data Mining >GaDei: On Scale-Up Training as a Service for Deep Learning
【24h】

GaDei: On Scale-Up Training as a Service for Deep Learning

机译:GaDei:将扩展培训作为深度学习的服务

获取原文

摘要

Deep learning (DL) training-as-a-service (TaaS) is an important emerging industrial workload. TaaS must satisfy a wide range of customers who have no experience and/or resources to tune DL hyper-parameters (e.g., mini-batch size and learning rate), and meticulous tuning for each user's dataset is prohibitively expensive. Therefore, TaaS hyper-parameters must be fixed with values that are applicable to all users. Unfortunately, few research papers have studied how to design a system for TaaS workloads. By evaluating the IBM Watson Natural Language Classfier (NLC) workloads, the most popular IBM cognitive service used by thousands of enterprise-level clients globally, we provide empirical evidence that only the conservative hyper-parameter setup (e.g., small mini-batch size) can guarantee acceptable model accuracy for a wide range of customers. Unfortunately, smaller mini-batch size requires higher communication bandwidth in a parameter-server based DL training system. In this paper, we characterize the exceedingly high communication bandwidth requirement of TaaS using representative industrial deep learning workloads. We then present GaDei, a highly optimized shared-memory based scale-up parameter server design. We evaluate GaDei using both commercial benchmarks and public benchmarks and demonstrate that GaDei significantly outperforms the state-of-the-art parameter-server based implementation while maintaining the required accuracy. GaDei achieves near-best-possible runtime performance, constrained only by the hardware limitation. Furthermore, to the best of our knowledge, GaDei is the only scale-up DL system that provides fault-tolerance.
机译:深度学习(DL)培训即服务(TaaS)是重要的新兴工业工作量。 TaaS必须满足没有经验和/或资源来调优DL超参数(例如小批量大小和学习率)的广泛客户,对每个用户的数据集进行细致的调整是非常昂贵的。因此,必须使用适用于所有用户的值来固定TaaS超参数。不幸的是,很少有研究论文研究过如何为TaaS工作负载设计系统。通过评估IBM Watson自然语言分类(NLC)工作负载(全球成千上万的企业级客户端使用的最流行的IBM认知服务),我们提供的经验证据表明只有保守的超参数设置(例如,小批量生产)可以保证为广大客户提供可接受的模型精度。不幸的是,在基于参数服务器的DL训练系统中,较小的小批量需要更高的通信带宽。在本文中,我们使用具有代表性的工业深度学习工作负载来表征TaaS极高的通信带宽需求。然后,我们介绍GaDei,这是一种高度优化的基于共享内存的按比例放大参数服务器设计。我们使用商业基准和公开基准对GaDei进行了评估,并证明了GaDei在保持所需准确性的同时,其性能远胜于基于最新参数服务器的实施方案。 GaDei实现了几乎最佳的运行时性能,仅受硬件限制。此外,据我们所知,GaDei是唯一提供容错功能的按比例放大DL系统。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号