GaDei: On Scale-Up Training as a Service for Deep Learning

机译：GaDei：将扩展培训作为深度学习的服务

获取原文

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Deep learning (DL) training-as-a-service (TaaS) is an important emerging industrial workload. TaaS must satisfy a wide range of customers who have no experience and/or resources to tune DL hyper-parameters (e.g., mini-batch size and learning rate), and meticulous tuning for each user's dataset is prohibitively expensive. Therefore, TaaS hyper-parameters must be fixed with values that are applicable to all users. Unfortunately, few research papers have studied how to design a system for TaaS workloads. By evaluating the IBM Watson Natural Language Classfier (NLC) workloads, the most popular IBM cognitive service used by thousands of enterprise-level clients globally, we provide empirical evidence that only the conservative hyper-parameter setup (e.g., small mini-batch size) can guarantee acceptable model accuracy for a wide range of customers. Unfortunately, smaller mini-batch size requires higher communication bandwidth in a parameter-server based DL training system. In this paper, we characterize the exceedingly high communication bandwidth requirement of TaaS using representative industrial deep learning workloads. We then present GaDei, a highly optimized shared-memory based scale-up parameter server design. We evaluate GaDei using both commercial benchmarks and public benchmarks and demonstrate that GaDei significantly outperforms the state-of-the-art parameter-server based implementation while maintaining the required accuracy. GaDei achieves near-best-possible runtime performance, constrained only by the hardware limitation. Furthermore, to the best of our knowledge, GaDei is the only scale-up DL system that provides fault-tolerance.

机译：深度学习（DL）培训即服务（TaaS）是重要的新兴工业工作量。 TaaS必须满足没有经验和/或资源来调优DL超参数（例如小批量大小和学习率）的广泛客户，对每个用户的数据集进行细致的调整是非常昂贵的。因此，必须使用适用于所有用户的值来固定TaaS超参数。不幸的是，很少有研究论文研究过如何为TaaS工作负载设计系统。通过评估IBM Watson自然语言分类（NLC）工作负载（全球成千上万的企业级客户端使用的最流行的IBM认知服务），我们提供的经验证据表明只有保守的超参数设置（例如，小批量生产）可以保证为广大客户提供可接受的模型精度。不幸的是，在基于参数服务器的DL训练系统中，较小的小批量需要更高的通信带宽。在本文中，我们使用具有代表性的工业深度学习工作负载来表征TaaS极高的通信带宽需求。然后，我们介绍GaDei，这是一种高度优化的基于共享内存的按比例放大参数服务器设计。我们使用商业基准和公开基准对GaDei进行了评估，并证明了GaDei在保持所需准确性的同时，其性能远胜于基于最新参数服务器的实施方案。 GaDei实现了几乎最佳的运行时性能，仅受硬件限制。此外，据我们所知，GaDei是唯一提供容错功能的按比例放大DL系统。

著录项

来源
《IEEE International Conference on Data Mining》|2017年|1195-1200|共6页
会议地点
作者
Wei Zhang; Minwei Feng; Yunhui Zheng; Yufei Ren; Yandong Wang; Ji Liu; Peng Liu; Bing Xiang; Li Zhang; Bowen Zhou; Fei Wang;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Bandwidth; Training; Machine learning; Servers; Graphics processing units; Tuning; Open source software;

机译：带宽;培训;机器学习;服务器;图形处理单元;调谐;开源软件;

相似文献

外文文献
中文文献
专利

1. Deep Learning for Consumer Devices and Services 4—A Review of Learnable Data Augmentation Strategies for Improved Training of Deep Neural Networks [J] . Lemley Joseph, Corcoran Peter Consumer Electronics Magazine, IEEE . 2020,第3期

机译：深度学习消费者设备和服务4-对深神经网络改进培训的学习数据增强策略的回顾
2. SOLAR: Services-Oriented Deep Learning Architectures-Deep Learning as a Service [J] . Wang Chao, Gong Lei, Li Xi, Services Computing, IEEE Transactions on . 2021,第1期

机译：太阳能：以服务为导向的深度学习架构 - 深入学习作为服务
3. Informing the scale-up of Kenya’s nursing workforce: a mixed methods study of factors affecting pre-service training capacity and production [J] . Ashley A Appiagyei, Rose N Kiriinya, Jessica M Gross, Human Resources for Health . 2014,第1期

机译：促进肯尼亚护理人员队伍的扩大：混合方法研究影响岗前培训能力和生产的因素
4. GaDei: On Scale-up Training As A Service For Deep Learning [C] . Wei Zhang, Minwei Feng, Yunhui Zheng, IEEE International Conference on Data Mining . 2017

机译：Gadei：在深度学习服务中培训培训
5. Measuring the perceived transfer of learning and training for a customer service training program delivered by line managers to call center employees in a Fortune 200 financial services company. [D] . Perez, Gustavo A. 2006

机译：评估直属经理向《财富》 200强金融服务公司的呼叫中心员工提供的客户服务培训计划的学习和培训转移。
6. Informing the scale-up of Kenya’s nursing workforce: a mixed methods study of factors affecting pre-service training capacity and production [O] . Ashley A Appiagyei, Rose N Kiriinya, Jessica M Gross, 2014

机译：促进肯尼亚护理人员队伍的扩大：混合方法研究影响岗前培训能力和生产的因素
7. GaDei: On Scale-up Training As A Service For Deep Learning [O] . Zhang, Wei, Feng, Minwei, Zheng, Yunhui, 2017

机译：GaDei：放大训练作为深度学习的服务

GaDei: On Scale-Up Training as a Service for Deep Learning

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅