首页> 外文会议>International Conference on Parallel and Distributed Processing Techniques and Applications >On the Calculation of the Checkpoint Interval in Run-Time for Parallel Applications
【24h】

On the Calculation of the Checkpoint Interval in Run-Time for Parallel Applications

机译:在并行应用程序运行中的计算间隔计算

获取原文

摘要

The growth in the number of components that compose parallel computers increases their fault frequency. Currently, in such systems faults are no longer a rare event but a common problem, thus some sort of fault tolerance should be provided. In general fault tolerance protocols rely on checkpoints. A common question surrounding checkpointing is the definition of the checkpoint interval. Checkpoint interval models define variables which depends on application characteristics, e.g. the time need to take a checkpoint. The use of average values and/or statistical data to define these variables reduces the model's accuracy. In this paper we propose a methodology to define in run-time the variables value needed to calculate the checkpoint interval. While using uncoordinated checkpoint this interval can be defined individually for each process of the parallel application. The variables definition relies on the measuring of the time spent on fault tolerance tasks in run-time. Experimental evaluation shows that the use of our methodology reduces in more than 3% the overhead introduced by fault tolerance while tested applications are running in a faulty environment.
机译:构图并行计算机的组件数量的增长增加了它们的故障频率。目前,在这种系统中,错误不再是一个罕见的事件,而是一个常见的问题,因此应该提供某种存在的容错。在一般容错协议依赖于检查点。围绕检查点的常见问题是检查点间隔的定义。检查点间隔模型定义了取决于应用特征的变量,例如,时间需要拍摄检查站。使用平均值和/或统计数据来定义这些变量降低了模型的准确性。在本文中,我们提出了一种方法来定义运行时间来计算检查点间隔所需的变量值。在使用未计算的检查点时,可以针对并行应用程序的每个过程单独定义此间隔。变量定义依赖于在运行时在容错任务上花费的时间。实验评估表明,我们的方法的使用在超过3%的情况下减少了通过容错引入的开销,而测试的应用程序在故障环境中运行。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号