首页> 外文会议>High Performance Computing and Grid in Asia Pacific Region, 2004. Proceedings. Seventh International Conference on >Design, implementation and performance of fault-tolerant message passing interface (MPI)
【24h】

Design, implementation and performance of fault-tolerant message passing interface (MPI)

机译:容错消息传递接口(MPI)的设计,实现和性能

获取原文

摘要

Fault tolerant MPI (FTMPI) enables fault tolerance to the MPICH, an open source GPL licensed implementation of MPI standard by Argonne National Laboratory's Mathematics and Computer Science Division. FTMPI is a transparent fault-tolerant environment, based on synchronous checkpointing and restarting mechanism. FTMPI relies on non-multithreaded single process checkpointing library to synchronously checkpoint an application process. Global replicated system controller and cluster node specific node controller monitors and controls check pointing and recovery activities of all MPI applications within the cluster. This work details the architecture to provide fault tolerance mechanism for MPI based applications running on clusters and the performance of NAS parallel benchmarks and parallelized medium range weather forecasting models, P-T80 and P-TI26. The architecture addresses the following issues also: Replicating system controller to avoid single point of failure. Ensuring consistency of checkpoint files based on distributed two phase commit protocol, and robust fault detection hierarchy.
机译:容错MPI(FTMPI)支持对MPICH的容错,MPICH是阿贡国家实验室数学和计算机科学部门的MPI标准的开源GPL许可实施。 FTMPI是一个透明的容错环境,基于同步检查点和重新启动机制。 FTMPI依靠非多线程单进程检查点库来同步检查应用程序进程。全局复制的系统控制器和特定于群集节点的节点控制器监视和控制群集中所有MPI应用程序的检查指向和恢复活动。这项工作详细介绍了为群集上运行的基于MPI的应用程序提供容错机制的体系结构,以及NAS并行基准测试和并行化的中程天气预报模型P-T80和P-TI26的性能。该体系结构还解决了以下问题:复制系统控制器以避免单点故障。基于分布式两阶段提交协议和稳健的故障检测层次结构,确保检查点文件的一致性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号