首页> 外文期刊>Experimental Mechanics >Evaluating and extending user-level fault tolerance in MPI applications
【24h】

Evaluating and extending user-level fault tolerance in MPI applications

机译:在MPI应用程序中评估和扩展用户级别的容错能力

获取原文
获取原文并翻译 | 示例
       

摘要

The user-level failure mitigation (ULFM) interface has been proposed to provide fault-tolerant semantics in the Message Passing Interface (MPI). Previous work presented performance evaluations of ULFM; yet questions related to its programability and applicability, especially to non-trivial, bulk synchronous applications, remain unanswered. In this article, we present our experiences on using ULFM in a case study with a large, highly scalable, bulk synchronous molecular dynamics application to shed light on the advantages and difficulties of this interface to program fault-tolerant MPI applications. We found that, although ULFM is suitable for master-worker applications, it provides few benefits for more common bulk synchronous MPI applications. To address these limitations, we introduce a new, simpler fault-tolerant interface for complex, bulk synchronous MPI programs with better applicability and support than ULFM for application-level recovery mechanisms, such as global rollback.
机译:已经提出了用户级故障缓解(ULFM)接口,以在消息传递接口(MPI)中提供容错语义。先前的工作介绍了ULFM的性能评估;然而,与它的可编程性和适用性有关的问题,尤其是对于非平凡的批量同步应用程序,仍然没有答案。在本文中,我们将介绍在具有大型,高度可扩展的本体同步分子动力学应用程序的案例研究中使用ULFM的经验,以阐明该接口对容错MPI应用程序进行编程的优势和困难。我们发现,尽管ULFM适用于主工人应用程序,但它对于更常见的批量同步MPI应用程序几乎没有好处。为了解决这些限制,我们为复杂的批量同步MPI程序引入了一个新的,更简单的容错接口,与ULFM相比,它具有更好的适用性和支持,可用于应用程序级恢复机制,例如全局回滚。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号