首页> 外文会议>2011 IEEE 30th International Performance, Computing, and Communications Conference >Evaluation of process level redundant checkpointing/restart for HPC systems
【24h】

Evaluation of process level redundant checkpointing/restart for HPC systems

机译:评估HPC系统的过程级别冗余检查点/重新启动

获取原文

摘要

In recent years, High Performance Computing (HPC) systems have been shifting from expensive massively parallel custom architectures to clusters of commodity personal computers to take advantage of cost and performance benefits. To avoid having to restart an application in case of sudden failure, checkpointing/restart fault tolerance mechanisms are commonly implemented. One drawback to checkpointing/restart is that it creates an overhead which increases the execution time of an application. We present a theoretical analysis of our technique. The results show that the PLR checkpointing/restart can significantly improve the overall reliability of an HPC system.
机译:近年来,高性能计算(HPC)系统已经从昂贵的大规模并行定制体系结构转变为商用个人计算机集群,以利用成本和性能优势。为了避免在突然失败的情况下必须重新启动应用程序,通常采用检查点/重新启动容错机制。检查点/重新启动的一个缺点是它会产生开销,从而增加了应用程序的执行时间。我们介绍了我们的技术的理论分析。结果表明,PLR检查点/重新启动可以显着提高HPC系统的整体可靠性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号