首页> 外文会议>Euromicro International Conference on Parallel, Distributed and Network-Based Processing >Fault-tolerant solutions for a MPI compute intensive application
【24h】

Fault-tolerant solutions for a MPI compute intensive application

机译:MPI计算密集型应用的容错解决方案

获取原文

摘要

The running times of large-scale computational science and engineering parallel applications, executed on clusters or Grid platforms, are usually longer than the mean-time-between-failures (MTBF). Hardware failures must be tolerated by the parallel applications to ensure that no all computation done is lost on machine failures. Checkpointing and rollback recovery is a very useful technique to implement fault-tolerant applications. Although extensive research has been carried out in this field, there are few available tools to help parallel programmers to enhance with fault tolerant capability their applications. This work presents two different approaches to endow with fault tolerance the MPI version of an air quality simulation. A segment-level solution has been implemented by means of the extension of a checkpointing library for sequential codes. A variable-level solution has been implemented manually in the code. The main differences between both approaches are portability, transparency-level and checkpointing overheads. Experimental results comparing both strategies on a cluster of PCs are shown in the paper.
机译:在集群或网格平台上执行的大规模计算科学和工程并行应用的运行时间通常长于平均故障(MTBF)。必须由并行应用程序容忍硬件故障,以确保在机器故障上丢失所有计算。检查点和回滚恢复是实现容错应用程序的非常有用的技术。虽然在此领域进行了广泛的研究,但很少有可用工具来帮助并行程序员增强其应用程序的容错能力。这项工作介绍了两种不同的方法,以赋予容错的空气质量模拟MPI版本。段级解决方案已经通过扩展了用于顺序代码的检查点化库来实现。在代码中手动实施可变级别解决方案。两种方法之间的主要差异是可移植性,透明度和检查点开销。纸张中显示了对PC集群策略进行比较的实验结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号