首页> 外文期刊>Knowledge-Based Systems >Online job scheduling for distributed machine learning in optical circuit switch networks
【24h】

Online job scheduling for distributed machine learning in optical circuit switch networks

机译:光电路交换机网络中分布式机器学习的在线工作调度

获取原文
获取原文并翻译 | 示例
       

摘要

Networking has become a well-known performance bottleneck for distributed machine learning (DML). Although lots of works have focused on accelerating the communication process of DML, they ignore the impact of the physical network on the DML performance. Concurrently, optical circuit switches (OCSes) are increasingly applied in data centers and clusters, which can fundamentally improve DML performance. It is worth noting that the non-negligible OCS reconfiguration delay makes OCS scheduling algorithms have a great impact on the upper application performance. However, existing OCS scheduling solutions are not suitable for DML jobs due to the iterative nature of DML jobs and their interleaving characteristics of communication and computation stages. Therefore, in this paper, we study the online multi-job scheduling for DML in OCS networks. Firstly, we propose heaviest-load-first (HLF), a heuristic algorithm for intra-job scheduling, which is based on the fact that the completion time of flows on the heaviest load port has a significant impact on the job completion time. Furthermore, we present Shortest Weighted Remaining Time First (SWRTF) algorithm for inter-job scheduling. In SWRTF, an available DML job is scheduled when the served job moves from communication stage to the computation stage, which significantly improves the circuit utilization. Based on large-scale simulations, we demonstrate HLF can significantly reduce the iteration communication time by up to 64.97% compared to the state-of-the-art circuit scheduler Sunflow. Besides, SWRTF can save up to 42.9%, 54.2%, 27.2% of Weighted-Job-Completion-Time (WJCT) compared to Shortest-Job-First, Baraat and Weighted-First inter-job scheduling algorithms, respectively. (C) 2020 Published by Elsevier B.V.
机译:网络已成为分布式机器学习(DML)的众所周知的性能瓶颈。虽然许多作品专注于加速DML的通信过程,但它们忽略了物理网络对DML性能的影响。同时,光电路开关(OCSES)越来越多地应用于数据中心和集群,这可以从根本上提高DML性能。值得注意的是,不可忽略的OCS重新配置延迟使OCS调度算法对上应用性能产生很大影响。然而,由于DML作业的迭代性质及其交织的通信和计算阶段的交织特征,现有的OCS调度解决方案不适合DML作业。因此,在本文中,我们研究了OCS网络中DML的在线多作业调度。首先,我们提出了最重载的第一(HLF),一种用于作业内调度的启发式算法,这是基于最重的负载端口上流的完成时间对工作完成时间产生重大影响。此外,我们对工作室间调度提供最短的加权剩余时间第一(SWRTF)算法。在SWRTF中,当服务作业从通信阶段移动到计算阶段时,计划可用的DML作业,这显着提高了电路利用率。基于大规模模拟,与最先进的电路调度器Sunflow相比,我们证明了HLF可以显着降低迭代通信时间高达64.97%。此外,与最短工作 - 第一,Baraat和加权 - 第一间位调度算法相比,SWRTF可以节省高达42.9%,54.2%,54.2%,54.2%,27.2%的加权工作完成 - 时间(WJCT)。 (c)2020由elsevier b.v发布。

著录项

  • 来源
    《Knowledge-Based Systems》 |2020年第9期|106002.1-106002.13|共13页
  • 作者单位

    Univ Elect Sci & Technol China Key Lab Opt Fiber Sensing & Commun Minist Educ Chengdu Peoples R China;

    Univ Elect Sci & Technol China Key Lab Opt Fiber Sensing & Commun Minist Educ Chengdu Peoples R China|Peng Cheng Lab Shenzhen Peoples R China;

    Univ Elect Sci & Technol China Key Lab Opt Fiber Sensing & Commun Minist Educ Chengdu Peoples R China;

    Univ Elect Sci & Technol China Key Lab Opt Fiber Sensing & Commun Minist Educ Chengdu Peoples R China;

    Univ Elect Sci & Technol China Key Lab Opt Fiber Sensing & Commun Minist Educ Chengdu Peoples R China;

    Southwest Jiaotong Univ Chengdu Peoples R China;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Distributed machine learning (DML); Optical circuit switch (OCS); Online job scheduling; Weighted Job Completion Time (WJCT);

    机译:分布式机器学习(DML);光电路开关(OCS);在线作业调度;加权作业完成时间(WJCT);
  • 入库时间 2022-08-18 21:28:49

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号