【24h】

Does partial replication pay off?

机译:部分复制是否会退还?

获取原文

摘要

As part counts in high performance computing systems are projected to increase faster than part reliabilities, there is increasing interest in enabling jobs to continue to execute in the presence of failures. Process replication has been shown to be a viable method to accomplish this, but previous studies have focussed on full replication levels (dual, triple, etc). In this work, we present a model for studying job interrupt times on systems of arbitrary replication degree, and arbitrary node failure distribution. We show agreement of this model with a previously developed simulator and make three key observations for systems using process replication; 1) job interrupts are not exponentially distributed (even when underlying node failures are), 2) job mean time to interrupt increases exponentially between full replication degrees, and 3) while partial replication may pay off for interrupt-dominated jobs, full replication degrees offer the best overall value.
机译:随着高性能计算系统中的一部分计数被预测以增加比部件可靠性更快,越来越兴趣启用作业,继续在发生故障时执行。过程复制已被证明是实现这一目标的可行方法,但之前的研究侧重于完全复制级别(双重,三重等)。在这项工作中,我们提出了一种在任意复制程度和任意节点故障分发系统上研究作业中断时间的模型。我们通过先前开发的模拟器显示了该模型的协议,并使用流程复制进行了三个关键观察; 1)作业中断不是指数分布的(即使基础节点故障是),2)工作平均中断的工作是在完整复制度和3)之间呈指数级呈指数增加的,而部分复制可能会为中断主导的作业付出代价,完整复制度优惠最好的总体价值。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号