首页> 外文期刊>Journal of network and computer applications >JPAS: Job-progress-aware flow scheduling for deep learning clusters
【24h】

JPAS: Job-progress-aware flow scheduling for deep learning clusters

机译:JPAS:用于深度学习集群的作业进度感知流调度

获取原文
获取原文并翻译 | 示例
           

摘要

Deep learning (DL) is an increasingly important tool for large-scale data analytics and DL workloads are also common in today's production clusters due to the increasing number of deep-learning-driven services (e.g., online search and speech recognition). To handle ever-growing training datasets, it is common to conduct distributed DL (DDL) training to leverage multiple machines in parallel. Training DL models in parallel can incur significant bandwidth contention on shared clusters. As a result, the network is a well-known bottleneck for distributed training. Efficient network scheduling is essential for maximizing the performance of DL training. DL training is feedback-driven exploration (e.g., hyper-parameter tuning, model structure optimization), which requires multiple retrainings of deep learning models that differ in terms of their configuration. The information at the early stage of each retraining can facilitate the direct search for high-quality models. Thus, reducing the early-stage time can accelerate the exploration of DL training. In this paper, we propose JPAS, which is a flow scheduling system for DDL training jobs that aims at reducing the early-stage time. JPAS uses a simple greedy mechanism to periodically order all DDL jobs. Each host machine sets priorities for its flows using the corresponding job order and offloads the flow scheduling and rate allocation to the underlying priority-enabled network. We evaluate JPAS over a real testbed that is composed of 13 servers and a commodity switch. The evaluation results demonstrate that JPAS can reduce the time to reach 90% or 95% of the converged accuracy by up to 38%. Hence, JPAS can remarkably reduce the early-stage time and thus accelerate the search for high-quality models.
机译:深度学习(DL)是用于大规模数据分析的日益重要的工具,并且由于深度学习驱动的服务(例如在线搜索和语音识别)的数量不断增加,因此在当今的生产集群中DL工作负载也很常见。为了处理不断增长的训练数据集,通常进行分布式DL(DDL)训练以并行利用多台机器。并行训练DL模型可能会导致共享群集上出现大量带宽争用。因此,网络是分布式培训的众所周知的瓶颈。高效的网络调度对于最大化D​​L训练的性能至关重要。 DL训练是由反馈驱动的探索(例如,超参数调整,模型结构优化),它需要对深度学习模型进行多次重新训练,但其配置不同。每次再培训初期的信息都可以帮助直接搜索高质量的模型。因此,减少早期时间可以加快对DL训练的探索。在本文中,我们提出了JPAS,这是一种用于DDL培训工作的流程计划系统,旨在减少早期时间。 JPAS使用简单的贪婪机制来定期订购所有DDL作业。每台主机使用相应的作业顺序为其流设置优先级,并将流调度和速率分配卸载到基础的启用优先级的网络。我们在由13个服务器和一个商品交换机组成的真实测试台上评估JPAS。评估结果表明,JPAS可以将达到融合精度的90%或95%的时间减少多达38%。因此,JPAS可以显着减少早期时间,从而加快对高质量模型的搜索。

著录项

  • 来源
    《Journal of network and computer applications》 |2020年第5期|102590.1-102590.15|共15页
  • 作者

  • 作者单位

    Univ Elect Sci & Technol China Commun & Informat Syst Chengdu Peoples R China;

    Southwest Jiaotong Univ Chengdu Peoples R China;

    Univ Elect Sci & Technol China Chengdu Peoples R China|Peng Cheng Lab Shenzhen Peoples R China;

    Univ Elect Sci & Technol China Comp Sci Chengdu Peoples R China;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Machine learning; Deep learning; Flow scheduling; Job progress aware;

    机译:机器学习;深度学习;流调度;知道工作进度;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号