...
首页> 外文期刊>Journal of Parallel and Distributed Computing >Proactive process-level live migration and back migration in HPC environments
【24h】

Proactive process-level live migration and back migration in HPC environments

机译:HPC环境中的主动流程级实时迁移和向后迁移

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission. This work complements reactive with proactive FT at the process level. Through health monitoring, a subset of node failures can be anticipated when one's health deteriorates. A novel process-level live migration mechanism supports continued execution of applications during much of process migration. This scheme is integrated into an MPI execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs. Experiments indicate that 1 -6.5 s of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 13-24 s. This self-healing approach complements reactive FT by nearly cutting the number of checkpoints in half when 70% of the faults are handled proactively. The work also provides a novel back migration approach to eliminate load imbalance or bottlenecks caused by migrated tasks. Experiments indicate the larger the amount of outstanding execution, the higher the benefit due to back migration.
机译:随着高性能计算环境中节点的数量不断增加,故障正变得越来越普遍。由于大量的I / O需求,反应式容错(FT)通常无法扩展,并且依赖于手动重新提交作业。这项工作在过程级别上与主动FT相辅相成。通过运行状况监视,可以预测节点运行状况恶化时节点故障的子集。一种新颖的流程级实时迁移机制支持在大部分流程迁移过程中继续执行应用程序。该方案已集成到MPI执行环境中,以透明地维持运行状况造成的节点故障,从而消除了重新启动和重新排队MPI作业的需要。实验表明,成功触发实时进程迁移需要1-6.5 s的事先警告,而类似的操作系统虚拟化机制则需要13-24 s的预警。当主动处理70%的故障时,这种自我修复方法通过将检查点的数量几乎减少一半来补充反应性FT。这项工作还提供了一种新颖的向后迁移方法,以消除由迁移的任务引起的负载不平衡或瓶颈。实验表明,出色的执行量越大,由于向后迁移而带来的收益就越高。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号