首页> 外文会议>International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms >A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI
【24h】

A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI

机译:标准MPI中基于算法恢复的检查点对失败协议

获取原文

摘要

Most predictions of Exascale machines picture billion way parallelism, encompassing not only millions of cores, but also tens of thousands of nodes. Even considering extremely optimistic advances in hardware reliability, probabilistic amplification entails that failures will be unavoidable. Consequently, software fault tolerance is paramount to maintain future scientific productivity. Two major problems hinder ubiquitous adoption of fault tolerance techniques: 1) traditional checkpoint based approaches incur a steep overhead on failure free operations and 2) the dominant programming paradigm for parallel applications (the MPI standard) offers extremely limited support of software-level fault tolerance approaches. In this paper, we present an approach that relies exclusively on the features of a high quality implementation, as defined by the current MPI standard, to enable algorithmic based recovery, without incurring the overhead of customary periodic checkpointing. The validity and performance of this approach are evaluated on large scale systems, using the QR factorization as an example.
机译:最重要的是ExaScale机器的图片亿方向并行,不仅包含数百万个核心,而且还包括数万节的节点。即使考虑到硬件可靠性的极其乐观的进步,概率扩大也需要失败将是不可避免的。因此,软件容错是保持未来的科学生产力。妨碍普遍存在的容错技术的两个主要问题:1)基于传统的检查点的方法在故障运行中产生陡峭的开销,2)并行应用的主导编程范例(MPI标准)提供对软件级容错的极其有限的支持方法。在本文中,我们提出了一种依赖于当前MPI标准定义的高质量实现特征的方法,以实现基于算法的恢复,而不会产生习惯定期检查点的开销。使用QR因分组作为示例,在大规模系统中评估此方法的有效性和性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号