首页> 外文会议>International conference on parallel and distributed processing techniques and applications;PDPTA 2011 >On the Calculation of the Checkpoint Interval in Run-Time for Parallel Applications
【24h】

On the Calculation of the Checkpoint Interval in Run-Time for Parallel Applications

机译:并行应用程序运行时检查点间隔的计算

获取原文

摘要

The growth in the number of components that compose parallel computers increases their fault frequency. Currently, in such systems faults are no longer a rare event but a common problem, thus some sort of fault tolerance should be provided. In general, fault tolerance protocols rely on checkpoints. A common question surrounding checkpointing is the definition of the checkpoint interval. Checkpoint interval models define variables which depends on application characteristics, e.g. the time need to take a checkpoint. The use of average values and/or statistical data to define these variables reduces the model's accuracy. In this paper we propose a methodology to define in run-time the variables value needed to calculate the checkpoint interval. While using uncoordinated checkpoint this interval can be defined individually for each process of the parallel application. The variables definition relies on the measuring of the time spent on fault tolerance tasks in run-time. Experimental evaluation shows that the use of our methodology reduces in more than 3% the overhead introduced by fault tolerance while tested applications are running in a faulty environment.
机译:组成并行计算机的组件数量的增加会增加其故障频率。当前,在这样的系统中,故障不再是罕见的事件,而是常见的问题,因此应该提供某种类型的容错能力。通常,容错协议依赖于检查点。围绕检查点的一个常见问题是检查点间隔的定义。检查点间隔模型定义取决于应用程序特征的变量,例如需要检查点的时间。使用平均值和/或统计数据来定义这些变量会降低模型的准确性。在本文中,我们提出了一种在运行时定义计算检查点间隔所需的变量值的方法。使用不协调的检查点时,可以为并行应用程序的每个进程分别定义此间隔。变量定义依赖于运行时在容错任务上花费的时间的度量。实验评估表明,在经过测试的应用程序在故障环境中运行时,使用我们的方法可将容错引入的开销减少3%以上。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号