Comparing the performance of rigid, moldable and grid-shaped applications on failure-prone HPC platforms

Le Fevre Valentin; Herault Thomas; Robert Yves; Bouteiller Aurelien; Hori Atsushi; Bosilca George; Dongarra Jack

首页> 外文期刊>Parallel Computing >Comparing the performance of rigid, moldable and grid-shaped applications on failure-prone HPC platforms

【24h】

Comparing the performance of rigid, moldable and grid-shaped applications on failure-prone HPC platforms

机译：比较刚性，可模态和网格形状应用对易于易发的HPC平台的性能

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

This paper compares the performance of different approaches to tolerate failures for applications executing on large-scale failure-prone platforms. We study (i) RIGID applications, which use a constant number of processors throughout execution; (ii) MOLDABLE applications, which can use a different number of processors after each restart following a fail-stop error; and (iii) GRIDSHAPED applications, which are moldable applications restricted to use rectangular processor grids (such as many dense linear algebra kernels). We start with checkpoint/restart, the de-facto standard approach. For each application type, we compute the optimal number of failures (i.e. that maximizes the yield of the application) to tolerate before relinquishing the current allocation and waiting until a new resource can be allocated, and we determine the optimal yield that can be achieved. For GRIDSHAPED applications, we also investigate Application Based Fault Tolerance (ABFT) techniques and perform the same analysis, computing the optimal number of failures to tolerate and the associated yield. We instantiate our performance model with realistic applicative scenarios and make it publicly available for further usage. We show that using spare nodes grants a much better yield than currently used strategies that restart after each failure. Moreover, the yield is similar for RIGID, MOLDABLE and GRIDSHAPED applications, while the optimal number of failures to tolerate is very high, even for a short wait time in between allocations. Finally, MOLDABLE applications have the advantage to restart less frequently than RIGID applications. (C) 2019 Elsevier B.V. All rights reserved.

机译：本文比较了不同方法来容忍在大型失败平台上执行的应用的失败的性能。我们研究（i）刚性应用，在整个执行中使用恒定数量的处理器; （ii）可模塑应用，可以在故障停止误差后重新启动后使用不同数量的处理器; （iii）GridShaped应用程序，其是可模塑应用，限制为使用矩形处理器网格（例如许多密集的线性代数核）。我们从检查站/重启开始，即求解标准方法。对于每个应用程序类型，我们计算最佳故障数（即，最大化应用程序的产量）以在放弃电流分配和等待之前要容忍，直到可以分配新资源，并且我们确定可以实现的最佳产量。对于GridShaped应用程序，我们还调查基于应用的容错（ABFT）技术并执行相同的分析，计算要容忍的最佳故障数和相关的产量。我们将我们的性能模型与现实的应用方案实例化，并公开可用于进一步使用。我们表明，使用备用节点授予每次故障后的目前使用的策略更好的收益率。此外，对于刚性，可模和格栅的应用，产量类似，而耐受的最佳故障数量非常高，即使在分配之间的短等待时间。最后，可模塑应用具有比刚性应用更频繁地重启的优点。（c）2019 Elsevier B.v.保留所有权利。

著录项

来源
《Parallel Computing》 |2019年第7期|1-12|共12页
作者
Le Fevre Valentin; Herault Thomas; Robert Yves; Bouteiller Aurelien; Hori Atsushi; Bosilca George; Dongarra Jack;
展开▼
作者单位

Ecole Normale Super Lyon Lyon France;

Univ Tennessee Knoxville TN 37996 USA;

Ecole Normale Super Lyon Lyon France|Univ Tennessee Knoxville TN 37996 USA;

Univ Tennessee Knoxville TN 37996 USA;

RIKEN Ctr Computat Sci Kobe Hyogo Japan;

Univ Tennessee Knoxville TN 37996 USA;

Univ Tennessee Knoxville TN 37996 USA|Univ Manchester Manchester Lancs England;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Comparing the performance of rigid, moldable and grid-shaped applications on failure-prone HPC platforms [J] . Le Fevre Valentin, Herault Thomas, Robert Yves, Parallel Computing . 2019,第JULa期

机译：比较容易失效的HPC平台上刚性，可模制和网格状应用程序的性能
2. Bi-Objective Optimization of Data-Parallel Applications on Heterogeneous HPC Platforms for Performance and Energy Through Workload Distribution [J] . Khaleghzadeh Hamidreza, Fahad Muhammad, Shahid Arsalan, IEEE Transactions on Parallel and Distributed Systems . 2021,第3期

机译：通过工作负载分布对性能和能量的异构HPC平台数据并行应用的双目标优化
3. Performance Framework for HPC Applications on Homogeneous Computing Platform [J] . Chandrashekhar B. N, Sanjay H. A International Journal of Image, Graphics and Signal Processing . 2019,第8期

机译：异构计算平台上HPC应用程序的性能框架
4. Do Moldable Applications Perform Better on Failure-Prone HPC Platforms? [C] . Valentin Le Fevre, George Bosilca, Aurelien Bouteiller, International Conference on Parallel and Distributed Computing . 2019

机译：可模塑应用是否对故障易于HPC平台进行更好？
5. Performance Enhancement of Multifunctional Surface Enhanced Raman Scattering: From Rigid to Flexible Platforms [D] . Kaichen, Xu 2018

机译：性能增强多功能表面增强拉曼散射：从刚性到柔性平台
6. P43-S Computational Biology Applications Suite for High-Performance Computing (BioHPC.net) [O] . J. Pillardy 2007

机译：适用于高性能计算的P43-S计算生物学应用套件（BioHPC.net）
7. Comparing the performance of rigid, moldable and grid-shaped applications on failure-prone HPC platforms [O] . Valentin Le Fèvre, Thomas Herault, Yves Robert, 2019

机译：比较刚性，可模态和网格形状应用对易于易发的HPC平台的性能

Comparing the performance of rigid, moldable and grid-shaped applications on failure-prone HPC platforms

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅