...
首页> 外文期刊>IEEE Transactions on Signal Processing >Computation Scheduling for Distributed Machine Learning With Straggling Workers
【24h】

Computation Scheduling for Distributed Machine Learning With Straggling Workers

机译:散乱的工人进行分布式机器学习的计算调度

获取原文
获取原文并翻译 | 示例

摘要

We study scheduling of computation tasks across $n$ workers in a large scale distributed learning problem with the help of a master. Computation and communication delays are assumed to be random, and redundant computations are assigned to workers in order to tolerate stragglers. We consider sequential computation of tasks assigned to a worker, while the result of each computation is sent to the master right after its completion. Each computation round, which can model an iteration of the stochastic gradient descent (SGD) algorithm, is completed once the master receives $k$ distinct computations, referred to as the computation target. Our goal is to characterize the average completion time as a function of the computation load, which denotes the portion of the dataset available at each worker, and the computation target. We propose two computation scheduling schemes that specify the tasks assigned to each worker, as well as their computation schedule, i.e., the order of execution. Assuming a general statistical model for computation and communication delays, we derive the average completion time of the proposed schemes. We also establish a lower bound on the minimum average completion time by assuming prior knowledge of the random delays. Experimental results carried out on Amazon EC2 cluster show a significant reduction in the average completion time over existing coded and uncoded computing schemes. It is also shown numerically that the gap between the proposed scheme and the lower bound is relatively small, confirming the efficiency of the proposed scheduling design.
机译:我们在大师的帮助下研究大规模分布式学习问题中跨$ n $工人的计算任务的调度。假定计算和通信延迟是随机的,并且将冗余计算分配给工作人员以容忍散乱的人群。我们考虑对分配给工作人员的任务进行顺序计算,而每次计算的结果在完成后立即发送给主服务器。一旦主机收到$ k $个不同的计算(称为计算目标),就可以完成对随机梯度下降(SGD)算法的迭代建模的每个计算回合。我们的目标是根据计算负荷来表征平均完成时间,该平均完成时间表示每个工作人员可用的数据集的一部分以及计算目标。我们提出了两种计算调度方案,这些方案指定了分配给每个工作人员的任务及其计算调度,即执行顺序。假设有一个用于计算和通信延迟的通用统计模型,我们可以得出所提出方案的平均完成时间。通过假设随机延迟的先验知识,我们还建立了最小平均完成时间的下限。在Amazon EC2集群上进行的实验结果表明,与现有编码和未编码计算方案相比,平均完成时间显着减少。从数值上还表明,所提出的方案与下限之间的差距相对较小,从而证实了所提出的调度设计的效率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号