首页> 外文期刊>Journal of Parallel and Distributed Computing >BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds

BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds


获取原文并翻译 | 示例


Infrastructure-as-a-Service (IaaS) cloud computing is gaining significant interest in industry and academia as an alternative platform for running HPC applications. Given the need to provide fault tolerance, support for suspend-resume and offline migration, an efficient Checkpoint-Restart mechanism becomes paramount in this context. We propose BlobCR, a dedicated checkpoint repository that is able to take live incremental snapshots of the whole disk attached to the virtual machine (VM) instances. BlobCR aims to minimize the performance overhead of checkpointing by persisting VM disk snapshots asynchronously in the background using a low overhead technique we call selective copy-on-write. It includes support for both application-level and process-level checkpointing, as well as support to roll back filesystem changes. Experiments at large scale demonstrate the benefits of our proposal both in synthetic settings and for a real-life HPC application.
机译:作为运行HPC应用程序的替代平台,基础设施即服务(IaaS)云计算在行业和学术界引起了极大的兴趣。考虑到需要提供容错能力,对挂起恢复和脱机迁移的支持,在这种情况下,高效的Checkpoint-Restart机制变得至关重要。我们建议使用BlobCR,这是一个专用的检查点存储库,它可以对连接到虚拟机(VM)实例的整个磁盘进行实时增量快照。 BlobCR旨在通过使用一种称为“写时选择性复制”的低开销技术,在后台异步地保留VM磁盘快照,从而最大程度地减少检查点的性能开销。它包括对应用程序级和过程级检查点的支持,以及对回滚文件系统更改的支持。大规模实验证明了我们的建议在合成环境和实际HPC应用中的好处。



  • 外文文献
  • 中文文献
  • 专利


京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号