Reinit~(++): Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance

机译：Reinit〜（++）：评估用于MPI容错的全局重新启动恢复方法的性能

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Scaling supercomputers comes with an increase in failure rates due to the increasing number of hardware components. In standard practice, applications are made resilient through checkpointing data and restarting execution after a failure occurs to resume from the latest checkpoint. However, re-deploying an application incurs overhead by tearing down and re-instating execution, and possibly limiting checkpointing retrieval from slow permanent storage.In this paper we present Reinit~(++), a new design and implementation of the Remit approach for global-restart recovery, which avoids application re-deployment. We extensively evaluate Reinit~(++) contrasted with the leading MPI fault-tolerance approach of ULFM, implementing global-restart recovery, and the typical practice of restarting an application to derive new insight on performance. Experimentation with three different HPC proxy applications made resilient to withstand process and node failures shows that Reinit~(++) recovers much faster than restarting, up to 6×, or ULFM, up to 3×, and that it scales excellently as the number of MPI processes grows.

机译：由于硬件组件数量的增加，扩展超级计算机的故障率也随之增加。在标准实践中，通过从检查点数据恢复应用程序并在从最新检查点恢复故障后重新开始执行，可以使应用程序具有弹性。但是，重新部署应用程序会导致操作中断和恢复执行，并且可能限制从缓慢的永久存储中进行检查点检索的开销。在本文中，我们提出Reinit〜（++），这是一种针对Remit方法的新设计和实现。全局重新启动恢复，避免了应用程序重新部署。与ULFM领先的MPI容错方法相比，我们广泛评估了Reinit〜（++），实现了全局重启恢复，以及重新启动应用程序以获取对性能的新见解的典型做法。通过对三种不同的HPC代理应用程序进行的实验，它们具有可抵抗过程和节点故障的恢复能力，表明Reinit〜（++）的恢复速度比重新启动要快得多，最高可达6倍，而ULFM最高可达3倍，并且随着数量的增加，它的扩展性也非常好MPI流程的增长。

著录项

来源
《International Conference ISC High Performance: International Conference on High Performance Computing》|2020年|536-554|共19页
会议地点
作者
Giorgis Georgakoudis; Luanzheng Guo; Ignacio Laguna;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Evaluating and extending user-level fault tolerance in MPI applications [J] . Laguna Ignacio, Richards David F., Gamblin Todd, Experimental Mechanics . 2016,第3期

机译：在MPI应用程序中评估和扩展用户级别的容错能力
2. Machine performance index (MPI): a method to evaluate the performance of mining dumper [J] . Pawan Kumar Yadav, Deepak Kuma, Suprakash Gupta Journal of Mines, Metals & Fuels . 2019,第6期

机译：机器性能指数（MPI）：一种评估矿山矿床性能的方法
3. Joint evaluation of recovery and performance of a COTS DBMS in the presence of operator faults [J] . Marco Vieira, Henrique Madeira Performance Evaluation . 2004,第1a4期

机译：在存在操作员故障的情况下，共同评估COTS DBMS的恢复和性能
4. A Software Based Approach for Providing Network Fault Tolerance in Clusters with uDAPL interface: MPI Level Design and Performance Evaluation [C] . Abhinav Vishnu, Prachi Gupta, Amith R. Mamidala, IEEE/ACM SC Conference . 2006

机译：基于软件的方法，用于使用UDAPL接口提供网络容错的方法：MPI级设计与性能评估
5. Partial fault tolerance in stream processing applications - Methods and evaluation techniques . [D] . Jacques Da Silva, Gabriela. 2010

机译：流处理应用中的部分容错-方法和评估技术。
6. Evaluation of the Fetal Left Ventricular Myocardial Performance Index (MPI) by Using an Automated Measurement of Doppler Signals in Normal Pregnancies [O] . Su-Min Kim, Soo-Young Ye 2021

机译：使用常规妊娠中的多普勒信号自动测量评价胎儿左心室心肌性能指数（MPI）
7. Reinit$$^{++}$$: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance [O] . Giorgis Georgakoudis, Luanzheng Guo, Ignacio Laguna 2020

机译：Reinit $$ ^ {++} $$：评估全球重启恢复方法的MPI容错的性能

Reinit~(++): Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance

摘要

著录项

相似文献

相关主题

期刊订阅