首页> 外文会议> >A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance
【24h】

A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance

机译:LAM / MPI + BLCR下的作业暂停服务可实现透明的容错能力

获取原文

摘要

Checkpoint/restart (C/R) has become a requirement for long-running jobs in large-scale clusters due to a meantime-to-failure (MTTF) in the order of hours. After a failure, C/R mechanisms generally require a complete restart of an MPI job from the last checkpoint. A complete restart, however, is unnecessary since all but one node is typically still alive. Furthermore, a restart may result in lengthy job requeuing even though the original job had not exceeded its time quantum. In this paper, we overcome these shortcomings. Instead of job restart, we have developed a transparent mechanism for job pause within LAM/MPI+BLCR. This mechanism allows live nodes to remain active and roll back to the last checkpoint while failed nodes are dynamically replaced by spares before resuming from the last checkpoint. Our methodology includes LAM/MPI enhancements in support of scalable group communication with fluctuating number of nodes, reuse of network connections, transparent coordinated checkpoint scheduling and a BLCR enhancement for job pause. Experiments in a cluster with the NAS parallel benchmark suite show that our overhead for job pause is comparable to that of a complete job restart. A minimal overhead of 5.6% is only incurred in case migration takes place while the regular checkpoint overhead remains unchanged. Yet, our approach alleviates the need to reboot the LAM run-time environment, which accounts for considerable overhead resulting in net savings of our scheme in the experiments. Our solution further provides full transparency and automation with the additional benefit of reusing existing resources. Executing continues after failures within the scheduled job, i.e., the application staging overhead is not incurred again in contrast to a restart. Our scheme offers additional potential for savings through incremental checkpointing and proactive diskless live migration, which we are currently working on.
机译:由于出现故障的平均时间(MTTF)约为数小时,因此检查点/重新启动(C / R)已成为大型集群中长期运行的作业的要求。发生故障后,C / R机制通常需要从最后一个检查点完全重新启动MPI作业。但是,由于通常只有一个节点仍处于活动状态,因此不需要完全重新启动。此外,即使原始作业未超出其时间范围,重新启动也可能导致冗长的作业重新排队。在本文中,我们克服了这些缺点。代替重新启动作业,我们开发了透明的机制来在LAM / MPI + BLCR中暂停作业。这种机制允许活动节点保持活动状态并回滚到最后一个检查点,而故障节点在从最后一个检查点恢复之前将被备用件动态替换。我们的方法包括对LAM / MPI的增强,以支持具有可变数量节点的可伸缩组通信,网络连接的重用,透明的协调检查点调度以及对作业暂停的BLCR增强。在具有NAS并行基准测试套件的群集中进行的实验表明,我们的作业暂停开销与完整的作业重启开销相当。仅在常规检查点开销保持不变的情况下进行迁移的情况下,才会产生5.6%的最小开销。但是,我们的方法减轻了重新启动LAM运行时环境的需要,这节省了可观的开销,从而在实验中节省了我们的方案。我们的解决方案进一步提供了完全的透明度和自动化,并具有重用现有资源的额外好处。在计划的作业内发生故障后,执行将继续,即与重新启动相比,不再产生应用程序登台开销。通过我们目前正在努力的增量检查点和主动无盘实时迁移,我们的方案提供了更多的节省潜力。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号