首页> 外文期刊>Journal of supercomputing >Assessing resilient versus stop-and-restart fault-tolerant solutions in MPI applications
【24h】

Assessing resilient versus stop-and-restart fault-tolerant solutions in MPI applications

机译:在MPI应用程序中评估弹性与停止和重启后的容错解决方案

获取原文
获取原文并翻译 | 示例
       

摘要

The Message Passing Interface (MPI) standard is the most popular parallel programming model for distributed systems. However, it lacks fault-tolerance support and, traditionally, failures are addressed with stop-and-restart checkpointing solutions. The proposal of User Level Failure Mitigation (ULFM) for the inclusion of resilience capabilities in the MPI standard provides new opportunities in this field, allowing the implementation of resilient MPI applications, i.e., applications that are able to detect and react to failures without stopping their execution. This work compares the performance of a traditional stop-and-restart checkpointing solution with its equivalent resilience proposal. Both approaches are built on top of ComPiler for Portable Checkpoiting (CPPC) an application-level checkpointing tool for MPI applications, and they allow to transparently obtain fault-tolerant MPI applications from generic MPI Single Program Multiple Data (SPMD). The evaluation is focused on the scalability of the two solutions, comparing both proposals using up to 3072 cores.
机译:消息传递接口(MPI)标准是分布式系统中最流行的并行编程模型。但是,它缺乏容错支持,并且传统上,故障是通过停止并重新启动检查点解决方案来解决的。用户级别的故障缓解(ULFM)建议在MPI标准中包含弹性功能提供了该领域的新机会,从而允许实施弹性MPI应用程序,即能够检测故障并对故障做出反应而无需停止故障的应用程序执行。这项工作将传统的停止和重新启动检查点解决方案的性能与同等的弹性建议进行了比较。两种方法都是基于ComPiler便携式检查点(CPPC)的基础,该工具是MPI应用程序的应用程序级检查点工具,它们允许从通用MPI单程序多数据(SPMD)透明地获取容错MPI应用程序。评估的重点是两个解决方案的可扩展性,比较了使用3072个内核的两个提议。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号