首页> 外文会议>International Conference on Parallel Processing Workshops >A Robust Fault Tolerance Scheme for Lifeline-Based Taskpools
【24h】

A Robust Fault Tolerance Scheme for Lifeline-Based Taskpools

机译:基于生命线的任务库的鲁棒容错方案

获取原文

摘要

Fault tolerance is of increasing importance for parallel computing. While often addressed at system level, application-level resilience techniques may be more efficient. In particular, it seems worthwhile to provide fault tolerant libraries for reusable patterns such as the task pool. We consider a task pool variant that uses cooperative work stealing, called the lifeline scheme. It is implemented in the GLB library of the PGAS programming language X10. Extending our own previous work, we present a fault-tolerance scheme for this setting, which is both communication-efficient and robust. Here, robustness denotes the ability to tolerate multiple coincident failures of interrelated workers. Our algorithm keeps two copies of important data, and tolerates almost all permanent place failures that leave one of the copies intact. For that, we nest execution of restore protocols. We implemented our algorithm within the GLB library. Performance measurements show a steal count dependent overhead of 5 to 40% during failure-free operation and a negligible overhead for restore.
机译:容错对于并行计算越来越重要。尽管通常在系统级别解决,但应用程序级别的弹性技术可能更有效。尤其值得一提的是为可重用模式(如任务池)提供容错库。我们考虑一种使用协作工作窃取的任务池变体,称为生命线方案。它在PGAS编程语言X10的GLB库中实现。在扩展自己的工作之前,我们针对此设置提出了一种容错方案,该方案既具有通信效率,又具有鲁棒性。在这里,健壮性是指承受相互关联的工人的多次同时失败的能力。我们的算法会保留重要数据的两个副本,并且几乎可以容忍所有永久性的位置故障,而这些故障会使其中一个副本完好无损。为此,我们嵌套执行恢复协议。我们在GLB库中实现了我们的算法。性能测量显示,在无故障运行期间,依赖于窃取计数的开销为5%到40%,而恢复的开销却可以忽略不计。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号