Redundancy Techniques for Straggler Mitigation in Distributed Optimization and Learning

Can Karakus; Yifan Sun; Suhas Diggavi; Wotao Yin

首页> 外文期刊>Journal of machine learning research >Redundancy Techniques for Straggler Mitigation in Distributed Optimization and Learning

【24h】

Redundancy Techniques for Straggler Mitigation in Distributed Optimization and Learning

机译：分布式优化与学习中施工缓解的冗余技术

获取原文

获取外文期刊封面目录资料

开具论文收录证明 >>

文献代查 >>

文献数据库（团队版） >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Performance of distributed optimization and learning systems is bottlenecked by “straggler” nodes and slow communication links, which significantly delay computation. We propose a distributed optimization framework where the dataset is “encoded” to have an over-complete representation with built-in redundancy, and the straggling nodes in the system are dynamically treated as missing, or as “erasures” at every iteration, whose loss is compensated by the embedded redundancy. For quadratic loss functions, we show that under a simple encoding scheme, many optimization algorithms (gradient descent, L-BFGS, and proximal gradient) operating under data parallelism converge to an approximate solution even when stragglers are ignored. Furthermore, we show a similar result for a wider class of convex loss functions when operating under model parallelism. The applicable classes of objectives covers several popular learning problems such as linear regression, LASSO, support vector machine, collaborative filtering, and generalized linear models including logistic regression. These convergence results are deterministic, i.e., they establish sample path convergence for arbitrary sequences of delay patterns or distributions on the nodes, and are independent of the tail behavior of the delay distribution. We demonstrate that equiangular tight frames have desirable properties as encoding matrices, and propose efficient mechanisms for encoding large-scale data. We implement the proposed technique on Amazon EC2 clusters, and demonstrate its performance over several learning problems, including matrix factorization, LASSO, ridge regression and logistic regression, and compare the proposed method with uncoded, asynchronous, and data replication strategies.

机译：分布式优化和学习系统的性能是“斯特拉格勒”节点和慢速通信链路的瓶颈，这显着延迟了计算。我们提出了一个分布式优化框架，其中数据集是“编码”，以具有内置冗余的完整表示，并且系统中的孤立节点被动态地被视为缺失，或者在每次迭代时都是“擦除”，其损失由嵌入式冗余补偿。对于二次损失函数，我们表明，在一个简单的编码方案下，即使忽略障碍物，许多在数据并行机会下运行的许多优化算法（梯度下降，L-BFG和近端梯度）也会收敛到近似解决方案。此外，我们在模型并行性下操作时，我们对更广泛的凸损函数显示了类似的结果。适用的目标课程涵盖了几个流行的学习问题，如线性回归，套索，支持向量机，协作过滤和包括逻辑回归的广义线性模型。这些收敛结果是确定性的，即，它们在节点上的延迟模式或分布的任意序列中建立样本路径会聚，并且与延迟分布的尾部行为无关。我们证明等方的紧密帧具有所需的性能作为编码矩阵，并提出了用于编码大规模数据的有效机制。我们在亚马逊EC2集群上实施了提出的技术，并在多个学习问题上展示其性能，包括矩阵分解，套索，脊回归和逻辑回归，并比较了具有未编码，异步和数据复制策略的提出方法。

著录项

来源
《Journal of machine learning research》 |2019年第a期|共47页
作者
Can Karakus; Yifan Sun; Suhas Diggavi; Wotao Yin;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Hone: Mitigating Stragglers in Distributed Stream Processing With Tuple Scheduling [J] . Li Wenxin, Liu Duowen, Chen Kai, IEEE Transactions on Parallel and Distributed Systems . 2021,第8期

机译：磨练：通过元组调度缓解分布式流处理中的陷阱器
2. BOA: batch orchestration algorithm for straggler mitigation of distributed DL training in heterogeneous GPU cluster [J] . Yang Eunju, Kang Dong-Ki, Youn Chan-Hyun Journal of supercomputing . 2020,第1期

机译：BOA：批处理编排算法，用于减轻异构GPU集群中分布式DL训练的拖累
3. Straggler Mitigation in Distributed Matrix Multiplication: Fundamental Limits and Optimal Coding [J] . IEEE Transactions on Information Theory . 2020,第3期

机译：分布式矩阵乘法中的减少流浪者：基本极限和最优编码
4. Distributed Machine Learning based Mitigating Straggler in Big Data Environment [C] . Haodong Lu, Kun Wang IEEE International Conference on Communications . 2021

机译：基于分布的机器学习的缓解史基机大数据环境
5. Performance Evaluation of Redundancy Techniques for Distributed Storage and Computing Systems [D] . ?Aktas, Mehmet Fatih 2020

机译：分布式存储和计算系统冗余技术的性能评估
6. Straggler-Aware Distributed Learning: Communication–Computation Latency Trade-Off [O] . Emre Ozfatura, Sennur Ulukus, Deniz Gündüz 2020

机译：Straggler-Aware分布式学习：通信 - 计算延迟权衡
7. Learning-Based Cooperative False Data Injection Attack and Its Mitigation Techniques in Consensus-Based Distributed Estimation [O] . Qiaomu Jiang, Huifang Chen, Lei Xie, 2020

机译：基于学习的合作虚假数据注入攻击及其缓解技术在基于共识的分布式估算中
8. Computational Techniques for Optimizing Systems With Standby Redundancy [R] . Henin, C. 1971

机译：利用待机冗余优化系统的计算技术

Redundancy Techniques for Straggler Mitigation in Distributed Optimization and Learning

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅