首页> 外文期刊>Parallel Algorithms and Applications >Cost-oriented proactive fault tolerance approach to high performance computing (HPC) in the cloud
【24h】

Cost-oriented proactive fault tolerance approach to high performance computing (HPC) in the cloud

机译:面向成本的云中高性能计算(HPC)的主动容错方法

获取原文
获取原文并翻译 | 示例

摘要

Cloud computing offers new computing paradigms, capacity and flexible solutions to high performance computing (HPC) applications. For example, Hardware as a Service (HaaS) allows users to provide a large number of virtual machines (VMs) for computation-intensive applications using the HaaS model. Due to the large number of VMs and electronic components in HPC system in the cloud, any fault during the execution would result in re-running the applications, which will cost time, money and energy. In this paper we presented a proactive fault tolerance (FT) approach to HPC systems in the cloud to reduce the wall-clock execution time and dollar cost in the presence of faults. We also developed a generic FT algorithm for HPC systems in the cloud. Our algorithm does not rely on a spare node prior to prediction of a failure. We also developed a cost model for executing computation-intensive applications on HPC systems in the cloud. We analysed the dollar cost of provisioning spare nodes and checkpointing FT to assess the value of our approach. Our experimental results obtained from a real cloud execution environment show that the wall-clock execution time and cost of running computation-intensive applications in cloud can be reduced by as much as 30%. The frequency of checkpointing of computation-intensive applications can be reduced up to 50% with our FT approach for HPC in the cloud compared with current FT approaches.
机译:云计算为高性能计算(HPC)应用程序提供了新的计算范例,容量和灵活的解决方案。例如,硬件即服务(HaaS)允许用户使用HaaS模型为计算密集型应用程序提供大量虚拟机(VM)。由于云中HPC系统中的大量VM和电子组件,执行过程中的任何错误都将导致重新运行应用程序,这将花费时间,金钱和精力。在本文中,我们针对云中的HPC系统提出了一种主动式的容错(FT)方法,以减少出现故障时的挂钟执行时间和美元成本。我们还为云中的HPC系统开发了通用FT算法。我们的算法在预测故障之前并不依赖于备用节点。我们还开发了一种成本模型,用于在云中的HPC系统上执行计算密集型应用程序。我们分析了供应备用节点和检查点FT的美元成本,以评估该方法的价值。我们从真实的云执行环境获得的实验结果表明,挂钟执行时间和在云中运行计算密集型应用程序的成本最多可减少30%。与当前的FT方法相比,使用针对云中HPC的FT方法,可将计算密集型应用程序的检查点频率降低多达50%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号