...
首页> 外文期刊>New Generation Computing >Improving Scalability of Application-Level Checkpoint-Recovery by Reducing Checkpoint Sizes
【24h】

Improving Scalability of Application-Level Checkpoint-Recovery by Reducing Checkpoint Sizes

机译:通过减少检查点大小来提高应用程序级检查点恢复的可伸缩性

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

The execution times of large-scale parallel applications on nowadays multi/many-core systems are usually longer than the mean time between failures. Therefore, parallel applications must tolerate hardware failures to ensure that not all computation done is lost on machine failures. Checkpointing and rollback recovery is one of the most popular techniques to implement fault-tolerant applications. However, checkpointing parallel applications is expensive in terms of computing time, network utilization and storage resources. Thus, current checkpoint-recovery techniques should minimize these costs in order to be useful for large scale systems. In this paper three different and complementary techniques to reduce the size of the checkpoints generated by application-level checkpointing are proposed and implemented. Detailed experimental results obtained on a multicore cluster show the effectiveness of the proposed methods to reduce checkpointing cost.
机译:在当今的多核/多核系统上,大规模并行应用程序的执行时间通常比两次故障之间的平均时间更长。因此,并行应用程序必须容忍硬件故障,以确保不会因计算机故障而丢失所有完成的计算。检查点和回滚恢复是实现容错应用程序的最流行技术之一。但是,就计算时间,网络利用率和存储资源而言,检查点并行应用程序的成本很高。因此,当前的检查点恢复技术应该最小化这些成本,以便对大型系统有用。在本文中,提出并实现了三种不同的补充技术来减少由应用程序级检查点生成的检查点的大小。在多核群集上获得的详细实验结果表明,所提出的方法可降低检查点成本。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号