首页> 外文会议>IEEE/ACM international symposium on cluster, cloud and grid computing >Checkpointing as a Service in Heterogeneous Cloud Environments
【24h】

Checkpointing as a Service in Heterogeneous Cloud Environments

机译:异构云环境中的检查点即服务

获取原文

摘要

A non-invasive, cloud-agnostic approach is demonstrated for extending existing cloud platforms to include checkpoint-restart capability. Most cloud platforms currently rely on each application to provide its own fault tolerance. A uniform mechanism within the cloud itself serves two purposes: (a) direct support for long-running jobs, which would otherwise require a custom fault-tolerant mechanism for each application, and (b) the administrative capability to manage an over-subscribed cloud by temporarily swapping out jobs when higher priority jobs arrive. An advantage of this uniform approach is that it also supports parallel and distributed computations, over both TCP and InfiniBand, thus allowing traditional HPC applications to take advantage of an existing cloud infrastructure. Additionally, an integrated health-monitoring mechanism detects when long-running jobs either fail or incur exceptionally low performance, perhaps due to resource starvation, and proactively suspends the job. The cloud-agnostic feature is demonstrated by applying the implementation to two very different cloud platforms: Snooze and Open Stack. The use of a cloud-agnostic architecture also enables, for the first time, migration of applications from one cloud platform to another.
机译:演示了一种非侵入性的,与云无关的方法,用于扩展现有云平台以包括检查点重启功能。当前,大多数云平台都依赖于每个应用程序来提供自己的容错能力。云内部的统一机制本身有两个用途:(a)直接支持长期运行的作业,否则将需要为每个应用程序使用自定义的容错机制;(b)具有管理超额订购云的管理能力通过在优先级较高的作业到达时临时调换作业。这种统一方法的优势在于,它还支持TCP和InfiniBand上的并行和分布式计算,从而允许传统的HPC应用程序利用现有的云基础架构。此外,集成的健康状况监视机制可以检测长时间运行的作业失败或可能由于资源匮乏而导致的性能异常低下的情况,并主动中止作业。通过将实现应用于两个截然不同的云平台来展示不可知的云功能:贪睡和开放堆栈。与云无关的体系结构的使用还首次实现了将应用程序从一个云平台迁移到另一个云平台。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号