首页> 外文期刊>International journal of grid and high performance computing >Modeling of Two-Level Checkpointing With Silent and Fail-Stop Errors in Grid Computing Systems
【24h】

Modeling of Two-Level Checkpointing With Silent and Fail-Stop Errors in Grid Computing Systems

机译:网格计算系统中静音和故障停止错误的两级检查点建模

获取原文
获取原文并翻译 | 示例
           

摘要

With the increase in high-performance computing platform size, it makes the system reliability more challenging, and system mean time between failures (MTBF) may be too short to supply a total fault-free run. Thereby, to achieve greater benefit from these systems, the applications must include fault tolerance mechanisms to satisfy the required reliability. This manuscript focuses on grid computing platform that exposes to two types of threats: crash and silent data corruption faults, which cause the application's failure. This manuscript also addresses the problem of modeling resource availability and aims to minimize the overhead of checkpoint/recovery-fault tolerance techniques. Modeling resources faults has commonly been addressed with exponential distribution, but that isn't fully realistic for the transient errors, which appear randomly. In the manuscript, the authors use Weibull distribution to express these random faults to create the optimal time to save checkpoints.
机译:随着高性能计算平台大小的增加,它使系统可靠性更具挑战性,并且系统在故障(MTBF)之间的平均时间可能太短暂,无法提供完全无故障运行。因此,为了从这些系统实现更大的利益,应用程序必须包括容错机制以满足所需的可靠性。此稿件专注于露出两种威胁的网格计算平台:崩溃和静默数据损坏故障,导致应用程序的故障。此稿件还解决了建模资源可用性的问题,并旨在最大限度地减少检查点/恢复容错技术的开销。建模资源故障通常具有指数分布,但这对瞬态误差并不完全逼真,随机出现。在稿件中,作者使用Weibull分布来表示这些随机故障以创建节省检查点的最佳时间。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号