首页> 外文会议> >A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance

【24h】

A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance

机译：LAM / MPI + BLCR下的作业暂停服务可实现透明的容错能力

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Checkpoint/restart (C/R) has become a requirement for long-running jobs in large-scale clusters due to a meantime-to-failure (MTTF) in the order of hours. After a failure, C/R mechanisms generally require a complete restart of an MPI job from the last checkpoint. A complete restart, however, is unnecessary since all but one node is typically still alive. Furthermore, a restart may result in lengthy job requeuing even though the original job had not exceeded its time quantum. In this paper, we overcome these shortcomings. Instead of job restart, we have developed a transparent mechanism for job pause within LAM/MPI+BLCR. This mechanism allows live nodes to remain active and roll back to the last checkpoint while failed nodes are dynamically replaced by spares before resuming from the last checkpoint. Our methodology includes LAM/MPI enhancements in support of scalable group communication with fluctuating number of nodes, reuse of network connections, transparent coordinated checkpoint scheduling and a BLCR enhancement for job pause. Experiments in a cluster with the NAS parallel benchmark suite show that our overhead for job pause is comparable to that of a complete job restart. A minimal overhead of 5.6% is only incurred in case migration takes place while the regular checkpoint overhead remains unchanged. Yet, our approach alleviates the need to reboot the LAM run-time environment, which accounts for considerable overhead resulting in net savings of our scheme in the experiments. Our solution further provides full transparency and automation with the additional benefit of reusing existing resources. Executing continues after failures within the scheduled job, i.e., the application staging overhead is not incurred again in contrast to a restart. Our scheme offers additional potential for savings through incremental checkpointing and proactive diskless live migration, which we are currently working on.

机译：由于出现故障的平均时间（MTTF）约为数小时，因此检查点/重新启动（C / R）已成为大型集群中长期运行的作业的要求。发生故障后，C / R机制通常需要从最后一个检查点完全重新启动MPI作业。但是，由于通常只有一个节点仍处于活动状态，因此不需要完全重新启动。此外，即使原始作业未超出其时间范围，重新启动也可能导致冗长的作业重新排队。在本文中，我们克服了这些缺点。代替重新启动作业，我们开发了透明的机制来在LAM / MPI + BLCR中暂停作业。这种机制允许活动节点保持活动状态并回滚到最后一个检查点，而故障节点在从最后一个检查点恢复之前将被备用件动态替换。我们的方法包括对LAM / MPI的增强，以支持具有可变数量节点的可伸缩组通信，网络连接的重用，透明的协调检查点调度以及对作业暂停的BLCR增强。在具有NAS并行基准测试套件的群集中进行的实验表明，我们的作业暂停开销与完整的作业重启开销相当。仅在常规检查点开销保持不变的情况下进行迁移的情况下，才会产生5.6％的最小开销。但是，我们的方法减轻了重新启动LAM运行时环境的需要，这节省了可观的开销，从而在实验中节省了我们的方案。我们的解决方案进一步提供了完全的透明度和自动化，并具有重用现有资源的额外好处。在计划的作业内发生故障后，执行将继续，即与重新启动相比，不再产生应用程序登台开销。通过我们目前正在努力的增量检查点和主动无盘实时迁移，我们的方案提供了更多的节省潜力。

著录项

来源
《》|2007年|1-10|共10页
会议地点
作者
Wang; C.; Mueller; F.; Engelmann; C.; Scott; S.L.;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
checkpointing; fault tolerant computing; local area networks; message passing; Berkeley Lab C/R; MPI; checkpoint scheduling; job pause service; local area multicomputer; meantime-to-failure; network connection; transparent fault tolerance;

机译：检查点;容错计算;局域网;消息传递; Berkeley Lab C / R; MPI;检查点调度;作业暂停服务;局域网多计算机;平均故障时间;网络连接;透明容错;

相似文献

外文文献
中文文献
专利

1. A Transparent Transient Faults Tolerance Mechanism for Superscalar Processors [J] . Toshinori SATO IEICE Transactions on Information and Systems . 2003 ,第12期

机译：超标量处理器的透明瞬态容错机制
2. MPI jobs within MPI jobs: A practical way of enabling task-level fault-tolerance in HPC workflows [J] . Wozniak Justin M., Dorier Matthieu, Ross Robert, Future generation computer systems . 2019 ,第Deca期

机译：MPI作业中的MPI作业：在HPC工作流程中启用任务级容错的实用方法
3. Multi-step Functional Process Adjustments to Reduce No-fault-found Product Failures in Service Caused by In-tolerance Faults [J] . P.K.S. Prakash, D. Ceglarek Procedia CIRP . 2013 ,第2期

机译：多步骤功能过程调整，以减少因容错故障而导致的服务中无故障发现的产品故障
4. A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance [C] . Chao Wang, Frank Mueller, Christian Engelmann, IEEE International Parallel and Distributed Processing Symposium . 2007

机译：LAM / MPI + BLCR下的作业暂停服务，用于透明容错
5. A unified framework for transparent parallelism and fault-tolerance in distributed systems. [D] . Yoo, Sunghwan. 2014

机译：分布式系统中透明并行性和容错性的统一框架。
6. An improved ant colony optimization algorithm with fault tolerance for job scheduling in grid computing systems [O] . Hajara Idris, Absalom E. Ezugwu, Sahalu B. Junaidu, -1

机译：网格计算系统中一种具有容错能力的蚁群优化算法
7. A job pause service under LAM/MPI+BLCR for transparent fault tolerance [O] . Chao Wang, Frank Mueller, Christian Engelmann, 2007

机译：LAM / MPI + BLCR下的作业暂停服务可实现透明的容错能力
8. Transparent Fault-Tolerance in Parallel Orca Programs [R] . Kaashoek, M. F., Michiels, R., Bal, H. E., 1991

机译：并行Orca程序中的透明容错

A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance

摘要

著录项

相似文献

相关主题

期刊订阅