JPAS: Job-progress-aware flow scheduling for deep learning clusters

首页> 外文期刊>Journal of network and computer applications >JPAS: Job-progress-aware flow scheduling for deep learning clusters

【24h】

JPAS: Job-progress-aware flow scheduling for deep learning clusters

机译：JPAS：用于深度学习集群的作业进度感知流调度

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Deep learning (DL) is an increasingly important tool for large-scale data analytics and DL workloads are also common in today's production clusters due to the increasing number of deep-learning-driven services (e.g., online search and speech recognition). To handle ever-growing training datasets, it is common to conduct distributed DL (DDL) training to leverage multiple machines in parallel. Training DL models in parallel can incur significant bandwidth contention on shared clusters. As a result, the network is a well-known bottleneck for distributed training. Efficient network scheduling is essential for maximizing the performance of DL training. DL training is feedback-driven exploration (e.g., hyper-parameter tuning, model structure optimization), which requires multiple retrainings of deep learning models that differ in terms of their configuration. The information at the early stage of each retraining can facilitate the direct search for high-quality models. Thus, reducing the early-stage time can accelerate the exploration of DL training. In this paper, we propose JPAS, which is a flow scheduling system for DDL training jobs that aims at reducing the early-stage time. JPAS uses a simple greedy mechanism to periodically order all DDL jobs. Each host machine sets priorities for its flows using the corresponding job order and offloads the flow scheduling and rate allocation to the underlying priority-enabled network. We evaluate JPAS over a real testbed that is composed of 13 servers and a commodity switch. The evaluation results demonstrate that JPAS can reduce the time to reach 90% or 95% of the converged accuracy by up to 38%. Hence, JPAS can remarkably reduce the early-stage time and thus accelerate the search for high-quality models.

机译：深度学习（DL）是用于大规模数据分析的日益重要的工具，并且由于深度学习驱动的服务（例如在线搜索和语音识别）的数量不断增加，因此在当今的生产集群中DL工作负载也很常见。为了处理不断增长的训练数据集，通常进行分布式DL（DDL）训练以并行利用多台机器。并行训练DL模型可能会导致共享群集上出现大量带宽争用。因此，网络是分布式培训的众所周知的瓶颈。高效的网络调度对于最大化DL训练的性能至关重要。 DL训练是由反馈驱动的探索（例如，超参数调整，模型结构优化），它需要对深度学习模型进行多次重新训练，但其配置不同。每次再培训初期的信息都可以帮助直接搜索高质量的模型。因此，减少早期时间可以加快对DL训练的探索。在本文中，我们提出了JPAS，这是一种用于DDL培训工作的流程计划系统，旨在减少早期时间。 JPAS使用简单的贪婪机制来定期订购所有DDL作业。每台主机使用相应的作业顺序为其流设置优先级，并将流调度和速率分配卸载到基础的启用优先级的网络。我们在由13个服务器和一个商品交换机组成的真实测试台上评估JPAS。评估结果表明，JPAS可以将达到融合精度的90％或95％的时间减少多达38％。因此，JPAS可以显着减少早期时间，从而加快对高质量模型的搜索。

著录项

来源
《Journal of network and computer applications》 |2020年第5期|102590.1-102590.15|共15页
作者

展开▼
作者单位

Univ Elect Sci & Technol China Commun & Informat Syst Chengdu Peoples R China;

Southwest Jiaotong Univ Chengdu Peoples R China;

Univ Elect Sci & Technol China Chengdu Peoples R China|Peng Cheng Lab Shenzhen Peoples R China;

Univ Elect Sci & Technol China Comp Sci Chengdu Peoples R China;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
Machine learning; Deep learning; Flow scheduling; Job progress aware;

机译：机器学习;深度学习;流调度;知道工作进度;

相似文献

外文文献
中文文献
专利

1. DL2: A Deep Learning-Driven Scheduler for Deep Learning Clusters [J] . Peng Yanghua, Bao Yixin, Chen Yangrui, IEEE Transactions on Parallel and Distributed Systems . 2021,第8期

机译：DL2：深度学习群集的深度学习驱动的调度程序
2. Deep Learning Research and Development Platform: Characterizing and Scheduling with QoS Guarantees on GPU Clusters [J] . Zhaoyun Chen, Wei Quan, Mei Wen, Parallel and Distributed Systems, IEEE Transactions on . 2020,第1期

机译：深度学习研发平台：在GPU集群上表征和调度QoS保证
3. Note on single-machine and flowshop scheduling with a general learning effect model and some single-machine and m-machine flowshop scheduling problems with learning considerations [J] . Kuo W.-H., Yang D.-L. Information Sciences: An International Journal . 2010,第19期

机译：关于具有一般学习效果模型的单机和Flowshop调度以及具有学习注意事项的一些单机和m机Flowshop调度问题的说明
4. Poster Abstract: Deep Learning Workloads Scheduling with Reinforcement Learning on GPU Clusters [C] . Zhaoyun Chen, Lei Luo, Wei Quan, IEEE Conference on Computer Communications Workshops . 2019

机译：海报摘要：深度学习工作负载调度与GPU集群上的强化学习
5. Efficient Execution of Scientific Workflows on Batch-Scheduled Clusters [D] . Hataishi, Evan. 2020

机译：有效地执行批量预定集群的科学工作流程
6. Intelligent Decision-Making of Scheduling for Dynamic Permutation Flowshop via Deep Reinforcement Learning [O] . Shengluo Yang, Zhigang Xu, Junyi Wang 2021

机译：通过深度加强学习来调度动态排列流程的智能决策
7. DL2: A Deep Learning-Driven Scheduler for Deep Learning Clusters [O] . Yanghua Peng, Yixin Bao, Yangrui Chen, 2021

机译：DL2：深度学习群集的深度学习驱动的调度程序

JPAS: Job-progress-aware flow scheduling for deep learning clusters

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅