首页> 外文期刊>Parallel and Distributed Systems, IEEE Transactions on >FTPA: Supporting Fault-Tolerant Parallel Computing through Parallel Recomputing
【24h】

FTPA: Supporting Fault-Tolerant Parallel Computing through Parallel Recomputing

机译:FTPA:通过并行重新计算支持容错并行计算

获取原文
获取原文并翻译 | 示例

摘要

As the size of large-scale computer systems increases, their mean-time-between-failures are becoming significantly shorter than the execution time of many current scientific applications. To complete the execution of scientific applications, they must tolerate hardware failures. Conventional rollback-recovery protocols redo the computation of the crashed process since the last checkpoint on a single processor. As a result, the recovery time of all protocols is no less than the time between the last checkpoint and the crash. In this paper, we propose a new application-level fault-tolerant approach for parallel applications called the Fault-Tolerant Parallel Algorithm (FTPA), which provides fast self-recovery. When fail-stop failures occur and are detected, all surviving processes recompute the workload of failed processes in parallel. FTPA, however, requires the user to be involved in fault tolerance. In order to ease the FTPA implementation, we developed Get it Fault-Tolerant (GiFT), a source-to-source precompiler tool to automate the FTPA implementation. We evaluate the performance of FTPA with parallel matrix multiplication and five kernels of NAS Parallel Benchmarks on a cluster system with 1,024 CPUs. The experimental results show that the performance of FTPA is better than the performance of the traditional checkpointing approach.
机译:随着大型计算机系统尺寸的增加,它们之间的平均故障间隔时间明显小于许多当前科学应用的执行时间。为了完成科学应用程序的执行,它们必须容忍硬件故障。常规回滚恢复协议重做自单个处理器上的最后一个检查点以来崩溃进程的计算。结果,所有协议的恢复时间不少于最后一个检查点与崩溃之间的时间。在本文中,我们为并行应用程序提出了一种新的应用程序级容错方法,称为容错并行算法(FTPA),该方法可提供快速的自我恢复。当故障停止故障发生并被检测到时,所有尚存的进程将并行地重新计算故障进程的工作量。但是,FTPA要求用户参与容错。为了简化FTPA的实现,我们开发了Get it-Tolerant(GiFT),这是一种源到源的预编译器工具,可自动执行FTPA。我们在具有1024个CPU的群集系统上,评估了并行矩阵乘法和五个NAS并行基准测试内核的FTPA的性能。实验结果表明,FTPA的性能优于传统的检查点方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号