首页> 外文会议>IEEE International Parallel and Distributed Processing Symposium Workshops >Application Fault Tolerance for Shrinking Resources via the Sparse Grid Combination Technique
【24h】

Application Fault Tolerance for Shrinking Resources via the Sparse Grid Combination Technique

机译:通过稀疏网格组合技术缩小资源的应用容错

获取原文

摘要

The need to make large-scale scientific simulations resilient to the shrinking and growing of compute resources arises from exascale computing and adverse operating conditions (fault tolerance). It can also arise from the cloud computing context where the cost of these resources can fluctuate. In this paper, we describe how the Sparse Grid Combination Technique can make such applications resilient to shrinking compute resources. The solution of the non-trivial issues of dealing with data redistribution and on-the-fly malleability of process grid information and ULFM MPI communicators are described. Results on a 2D advection solver indicate that process recovery time is significantly reduced from the alternate strategy where failed resources are replaced, overall execution time is actually improved from this case and for checkpointing and the execution error remains small, even when multiple failures occur.
机译:从ExaScale计算和不利的操作条件(容错)产生了大规模的科学模拟对计算资源的缩小和生长的需要。它也可以从云计算上下文中出现,其中这些资源的成本可能波动。在本文中,我们描述了稀疏的网格组合技术如何使这些应用有所弹出来缩小计算资源。描述了处理数据重新分布的非琐碎问题和处理网格信息和ULFM MPI通信器的禁用弹性的问题。结果在2D平流求解器上表明,从替换失败资源的替代策略,过程恢复时间显着降低,从这种情况下实际提高了总体执行时间,并且对于检查点,即使发生多个故障,执行错误也会仍然很小。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号