Computation Scheduling for Distributed Machine Learning With Straggling Workers

首页> 外文期刊>IEEE Transactions on Signal Processing >Computation Scheduling for Distributed Machine Learning With Straggling Workers

【24h】

Computation Scheduling for Distributed Machine Learning With Straggling Workers

机译：散乱的工人进行分布式机器学习的计算调度

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

团队文献服务 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

We study scheduling of computation tasks across $n$ workers in a large scale distributed learning problem with the help of a master. Computation and communication delays are assumed to be random, and redundant computations are assigned to workers in order to tolerate stragglers. We consider sequential computation of tasks assigned to a worker, while the result of each computation is sent to the master right after its completion. Each computation round, which can model an iteration of the stochastic gradient descent (SGD) algorithm, is completed once the master receives $k$ distinct computations, referred to as the computation target. Our goal is to characterize the average completion time as a function of the computation load, which denotes the portion of the dataset available at each worker, and the computation target. We propose two computation scheduling schemes that specify the tasks assigned to each worker, as well as their computation schedule, i.e., the order of execution. Assuming a general statistical model for computation and communication delays, we derive the average completion time of the proposed schemes. We also establish a lower bound on the minimum average completion time by assuming prior knowledge of the random delays. Experimental results carried out on Amazon EC2 cluster show a significant reduction in the average completion time over existing coded and uncoded computing schemes. It is also shown numerically that the gap between the proposed scheme and the lower bound is relatively small, confirming the efficiency of the proposed scheduling design.

机译：我们在大师的帮助下研究大规模分布式学习问题中跨$ n $工人的计算任务的调度。假定计算和通信延迟是随机的，并且将冗余计算分配给工作人员以容忍散乱的人群。我们考虑对分配给工作人员的任务进行顺序计算，而每次计算的结果在完成后立即发送给主服务器。一旦主机收到$ k $个不同的计算（称为计算目标），就可以完成对随机梯度下降（SGD）算法的迭代建模的每个计算回合。我们的目标是根据计算负荷来表征平均完成时间，该平均完成时间表示每个工作人员可用的数据集的一部分以及计算目标。我们提出了两种计算调度方案，这些方案指定了分配给每个工作人员的任务及其计算调度，即执行顺序。假设有一个用于计算和通信延迟的通用统计模型，我们可以得出所提出方案的平均完成时间。通过假设随机延迟的先验知识，我们还建立了最小平均完成时间的下限。在Amazon EC2集群上进行的实验结果表明，与现有编码和未编码计算方案相比，平均完成时间显着减少。从数值上还表明，所提出的方案与下限之间的差距相对较小，从而证实了所提出的调度设计的效率。

著录项

来源
《IEEE Transactions on Signal Processing 》 |2019年第24期| 6270-6284| 共15页
作者

展开▼
作者单位

Princeton Univ Dept Elect Engn Princeton NJ 08544 USA;

Imperial Coll London Dept Elect & Elect Engn London SW7 2AZ England;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Task analysis; Delays; Processor scheduling; Computational modeling; Encoding; Schedules; Decoding; Distributed machine learning; uncoded computing; computation scheduling; straggling workers;

机译：任务分析;延误;处理器调度;计算建模;编码;时间表;解码;分布式机器学习;未编码的计算;计算调度;散乱的工人;

相似文献

外文文献
中文文献
专利

1. A combinatorial evolutionary algorithm for unrelated parallel machine scheduling problem with sequence and machine-dependent setup times, limited worker resources and learning effect [J] . Zhang Like, Deng Qianwang, Lin Ruihang, Expert systems with applications . 2021 ,第Auga期

机译：序列和机器依赖的设置时间，有限的工人资源和学习效果的无关并行机器调度问题的组合进化算法
2. Note on single-machine and flowshop scheduling with a general learning effect model and some single-machine and m-machine flowshop scheduling problems with learning considerations [J] . Kuo W.-H., Yang D.-L. Information Sciences: An International Journal . 2010 ,第19期

机译：关于具有一般学习效果模型的单机和Flowshop调度以及具有学习注意事项的一些单机和m机Flowshop调度问题的说明
3. Heuristic algorithms for scheduling iterative task computations on distributed memory machines [J] . Tao Yang, Cong Fu IEEE Transactions on Parallel and Distributed Systems . 1997 ,第6期

机译：用于在分布式存储机器上调度迭代任务计算的启发式算法
4. Computation Scheduling for Distributed Machine Learning with Straggling Workers [C] . Mohammad Mohammadi Amiri, Deniz Gündüz IEEE International Conference on Acoustics, Speech and Signal Processing . 2019

机译：散乱的工人进行分布式机器学习的计算调度
5. Adaptive scheduling of master/worker applications on distributed computational resources. [D] . Shao, Gary. 2001

机译：主机/工人应用程序在分布式计算资源上的自适应调度。
6. Straggler-Aware Distributed Learning: Communication–Computation Latency Trade-Off [O] . Emre Ozfatura, Sennur Ulukus, Deniz Gündüz 2020

机译：Straggler-Aware分布式学习：通信 - 计算延迟权衡
7. Computation Scheduling for Distributed Machine Learning with Straggling Workers [O] . Mohammad Mohammadi Amiri, Deniz Gunduz 2019

机译：分布式机器学习与谋杀工人的计算调度

Computation Scheduling for Distributed Machine Learning With Straggling Workers

摘要

著录项

相似文献

相关主题

期刊订阅