...
首页> 外文期刊>Parallel Computing >Comparing the performance of rigid, moldable and grid-shaped applications on failure-prone HPC platforms
【24h】

Comparing the performance of rigid, moldable and grid-shaped applications on failure-prone HPC platforms

机译:比较刚性,可模态和网格形状应用对易于易发的HPC平台的性能

获取原文
获取原文并翻译 | 示例

摘要

This paper compares the performance of different approaches to tolerate failures for applications executing on large-scale failure-prone platforms. We study (i) RIGID applications, which use a constant number of processors throughout execution; (ii) MOLDABLE applications, which can use a different number of processors after each restart following a fail-stop error; and (iii) GRIDSHAPED applications, which are moldable applications restricted to use rectangular processor grids (such as many dense linear algebra kernels). We start with checkpoint/restart, the de-facto standard approach. For each application type, we compute the optimal number of failures (i.e. that maximizes the yield of the application) to tolerate before relinquishing the current allocation and waiting until a new resource can be allocated, and we determine the optimal yield that can be achieved. For GRIDSHAPED applications, we also investigate Application Based Fault Tolerance (ABFT) techniques and perform the same analysis, computing the optimal number of failures to tolerate and the associated yield. We instantiate our performance model with realistic applicative scenarios and make it publicly available for further usage. We show that using spare nodes grants a much better yield than currently used strategies that restart after each failure. Moreover, the yield is similar for RIGID, MOLDABLE and GRIDSHAPED applications, while the optimal number of failures to tolerate is very high, even for a short wait time in between allocations. Finally, MOLDABLE applications have the advantage to restart less frequently than RIGID applications. (C) 2019 Elsevier B.V. All rights reserved.
机译:本文比较了不同方法来容忍在大型失败平台上执行的应用的失败的性能。我们研究(i)刚性应用,在整个执行中使用恒定数量的处理器; (ii)可模塑应用,可以在故障停止误差后重新启动后使用不同数量的处理器; (iii)GridShaped应用程序,其是可模塑应用,限制为使用矩形处理器网格(例如许多密集的线性代数核)。我们从检查站/重启开始,即求解标准方法。对于每个应用程序类型,我们计算最佳故障数(即,最大化应用程序的产量)以在放弃电流分配和等待之前要容忍,直到可以分配新资源,并且我们确定可以实现的最佳产量。对于GridShaped应用程序,我们还调查基于应用的容错(ABFT)技术并执行相同的分析,计算要容忍的最佳故障数和相关的产量。我们将我们的性能模型与现实的应用方案实例化,并公开可用于进一步使用。我们表明,使用备用节点授予每次故障后的目前使用的策略更好的收益率。此外,对于刚性,可模和格栅的应用,产量类似,而耐受的最佳故障数量非常高,即使在分配之间的短等待时间。最后,可模塑应用具有比刚性应用更频繁地重启的优点。 (c)2019 Elsevier B.v.保留所有权利。

著录项

  • 来源
    《Parallel Computing 》 |2019年第7期| 1-12| 共12页
  • 作者单位

    Ecole Normale Super Lyon Lyon France;

    Univ Tennessee Knoxville TN 37996 USA;

    Ecole Normale Super Lyon Lyon France|Univ Tennessee Knoxville TN 37996 USA;

    Univ Tennessee Knoxville TN 37996 USA;

    RIKEN Ctr Computat Sci Kobe Hyogo Japan;

    Univ Tennessee Knoxville TN 37996 USA;

    Univ Tennessee Knoxville TN 37996 USA|Univ Manchester Manchester Lancs England;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号