【24h】

TeaMPI-Replication-Based Resilience Without the (Performance) Pain

机译:基于TeaMPI复制的弹性(无性能)疼痛

获取原文

摘要

In an era where we can not afford to checkpoint frequently, replication is a generic way forward to construct numerical simulations that can continue to run even if hardware parts fail. Yet, replication often is not employed on larger scales, as naively mirroring a computation once effectively halves the machine size, and as keeping replicated simulations consistent with each other is not trivial. We demonstrate for the ExaHyPE engine-a task-based solver for hyperbolic equation systems-that it is possible to realise resiliency without major code changes on the user side, while we introduce a novel algorithmic idea where replication reduces the time-to-solution. The redundant CPU cycles are not burned "for nothing". Our work employs a weakly consistent data model where replicas run independently yet inform each other through heartbeat messages whether they are still up and running. Our key performance idea is to let the tasks of the replicated simulations share some of their outcomes, while we shuffle the actual task execution order per replica. This way, replicated ranks can skip some local computations and automatically start to synchronise with each other. Our experiments with a production-level seismic wave-equation solver provide evidence that this novel concept has the potential to make replication affordable for large-scale simulations in high-performance computing.
机译:在我们无力频繁检查点的时代,复制是构建数值模拟的通用方法,即使硬件部件发生故障,数值模拟也可以继续运行。但是,通常不会大规模使用复制,因为幼稚地将计算有效地镜像一次就可以将计算机大小减半,并且使复制的模拟彼此保持一致也不是一件容易的事。我们为ExaHyPE引擎(一种用于双曲方程组的基于任务的求解器)证明,无需在用户侧进行重大代码更改即可实现弹性,同时我们引入了一种新颖的算法思想,其中复制可以减少求解时间。多余的CPU周期不会“毫无用处”地燃烧。我们的工作采用了弱一致性数据模型,其中副本独立运行,但通过心跳消息相互告知副本是否仍在运行。我们的关键性能思想是让复制的仿真任务共享某些结果,同时我们重新安排每个副本的实际任务执行顺序。这样,复制的秩可以跳过一些本地计算,并自动开始彼此同步。我们在生产级地震波方程求解器上进行的实验提供了证据,表明这一新概念具有使高性能计算中的大规模仿真能够承受得起复制的潜力。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号