首页> 外文会议> >FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI

【24h】

FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI

机译：FTC-Charm ++：Charm ++和MPI的基于内存检查点的容错运行时

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

As high performance clusters continue to grow in size, the mean time between failures shrinks. Thus, the issues of fault tolerance and reliability are becoming one of the challenging factors for application scalability. The traditional disk-based method of dealing with faults is to checkpoint the state of the entire application periodically to reliable storage and restart from the recent checkpoint. The recovery of the application from faults involves (often manually) restarting applications on all processors and having it read the data from disks on all processors. The restart can therefore take minutes after it has been initiated. Such a strategy requires that the failed processor can be replaced so that the number of processors at checkpoint-time and recovery-time are the same. We present FTC-Charms ++, a fault-tolerant runtime based on a scheme for fast and scalable in-memory checkpoint and restart. At restart, when there is no extra processor, the program can continue to run on the remaining processors while minimizing the performance penalty due to losing processors. The method is useful for applications whose memory footprint is small at the checkpoint state, while a variation of this scheme - in-disk checkpoint/restart can be applied to applications with large memory footprint. The scheme does not require any individual component to be fault-free. We have implemented this scheme for Charms++ and AMPI (an adaptive version of MPl). This work describes the scheme and shows performance data on a cluster using 128 processors.

机译：随着高性能群集规模的不断扩大，两次故障之间的平均时间会缩短。因此，容错和可靠性问题正成为应用程序可伸缩性的挑战性因素之一。传统的基于磁盘的故障处理方法是将整个应用程序的状态定期检查点到可靠的存储中，然后从最近的检查点重新启动。从故障中恢复应用程序包括（通常是手动）重新启动所有处理器上的应用程序，并使其从所有处理器上的磁盘读取数据。因此，重新启动可能需要几分钟才能启动。这种策略要求可以更换发生故障的处理器，以使检查点时间和恢复时间的处理器数量相同。我们提出了FTC-Charms ++，这是一种基于快速且可扩展的内存中检查点和重新启动方案的容错运行时。在重新启动时，如果没有多余的处理器，则该程序可以继续在其余处理器上运行，同时将由于丢失处理器而导致的性能损失降至最低。该方法对于在检查点状态下内存占用量较小的应用程序很有用，而此方案的一种变体-磁盘内检查点/重新启动可以应用于内存占用量较大的应用程序。该方案不需要任何单独的组件都是无故障的。我们已经针对Charms ++和AMPI（MP1的自适应版本）实现了该方案。这项工作描述了该方案，并显示了使用128个处理器的群集上的性能数据。

著录项

来源
《》|2004年|p.93-103|共11页
会议地点
作者
Gengbin Zheng; Lixia Shi; Kale; L.V.;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类无线电电子学、电信技术;
关键词
system recovery; application program interfaces; message passing; workstation clusters; fault tolerant computing; FTC-Charm++; in-memory checkpoint-based fault tolerant runtime; Charm++; MPI; high performance clusters; application scalability; disk-based method; checkpoint-time; recovery-time; fault-tolerant runtime; in-memory restart; in-disk checkpoint; in-disk restart; AMPI;

机译：系统恢复;应用程序接口;消息传递;工作站集群;容错计算; FTC-Charm ++;基于内存中基于检查点的容错运行时; Charm ++; MPI;高性能集群;应用程序可扩展性;基于磁盘的方法;检查点时间;恢复时间;容错运行时;内存重启;磁盘检查点;磁盘重启; AMPI;

相似文献

外文文献
中文文献
专利

1. In-memory application-level checkpoint-based migration for MPI programs [J] . Ivan Cores, Gabriel Rodriguez, Maria J. Martin, Journal of supercomputing . 2014,第2期

机译：MPI程序基于内存的基于应用程序级检查点的迁移
2. Analyzing fault aware collective performance in a process fault tolerant MPI [J] . Joshua Hursey, Richard L. Graham Parallel Computing . 2012,第1a2期

机译：分析过程容错MPI中的故障感知总体性能
3. DEFT: Dynamic Fault-Tolerant Elastic scheduling for tasks with uncertain runtime in cloud [J] . Yan Hui, Zhu Xiaomin, Chen Huangke, Information Sciences: An International Journal . 2019,第期

机译：DEFT：云中不确定运行时的任务的动态容错弹性调度
4. FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI [C] . 2004 IEEE International Conference on Cluster Computing . 2004

机译：FTC-Charm ++：Charm ++和MPI的基于内存检查点的容错运行时
5. Designing Efficient MPI and UPC Runtime for Multicore Clusters with InfiniBand, Accelerators and Co-Processors. [D] . Luo, Miao. 2013

机译：使用InfiniBand，加速器和协处理器为多核集群设计高效的MPI和UPC运行时。
6. Sliding Mode Fault Tolerant Control for Unmanned Aerial Vehicle with Sensor and Actuator Faults [O] . Juan Tan, Yonghua Fan, Pengpeng Yan, 2019

机译：具有传感器和执行器故障的无人机滑模容错控制
7. Ftc-charm++: An in-memory checkpoint-based fault tolerant runtime for charm++ and mpi [O] . Gengbin Zheng, Lixia Shi, Laxmikant V. Kalé 2004

机译：Ftc-charm ++：用于魅力++和mpi的基于内存检查点的容错运行时

FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI

摘要

著录项

相似文献

相关主题

期刊订阅