首页> 外文会议>IEEE International Symposium on Parallel and Distributed Processing >Building a Fault Tolerant MPI Application: A Ring Communication Example
【24h】

Building a Fault Tolerant MPI Application: A Ring Communication Example

机译:构建故障容错MPI应用:环形通信示例

获取原文

摘要

Process failure is projected to become a normal event for many long running and scalable High Performance Computing (HPC) applications. As such many application developers are investigating Algorithm Based Fault Tolerance (ABFT) techniques to improve the efficiency of application recovery beyond what existing checkpoint/restart techniques alone can provide. Unfortunately for these application developers the libraries that their applications depend upon, like Message Passing Interface (MPI), do not have standardized fault tolerance semantics. This paper introduces the reader to a set of run-through stabilization semantics being developed by the MPI Forum's Fault Tolerance Working Group to support ABFT. Using a well-known ring communication program as the running example, this paper illustrates to application developers new to ABFT some of the issues that arise when designing a fault tolerant application. The ring program allows the paper to focus on the communication-level issues rather than the data preservation mechanisms covered by existing literature. This paper highlights a common set of issues that application developers must address in their design including program control management, duplicate message detection, termination detection, and testing. The discussion provides application developers new to ABFT with an introduction to both new interfaces becoming available, and a range of design issues that they will likely need to address regardless of their research domain.
机译:Process Failure预计将成为许多长期运行和可扩展高性能计算(HPC)应用程序的正常事件。由于这类许多应用程序开发人员正在研究基于算法的容错(ABFT)技术,以提高应用程序恢复的效率,超出现有的检查点/重启技术可以提供。不幸的是,这些程序的开发库,他们的应用程序依赖于,如消息传递接口(MPI),没有标准化的容错语义。本文将读者介绍了由MPI论坛的容错工作组开发的一组循环稳定语义,以支持ABFT。使用众所周知的环形通信程序作为运行示例,本文说明了在设计容错应用程序时出现的一些问题的应用程序开发人员。环程序允许本文专注于通信级问题,而不是现有文献所涵盖的数据保存机制。本文突出了应用程序开发人员必须在其设计中解决的常见问题集,包括程序控制管理,重复消息检测,终止检测和测试。讨论为ABFT提供了新的应用程序开发人员,并介绍了这两个新接口的可用性,以及一系列的设计问题,他们可能需要解决,无论其研究领域如何解决。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号