【24h】

A Proactive Fault Tolerance Approach to High Performance Computing (HPC) in the Cloud

机译:云中高性能计算(HPC)的主动容错方法

获取原文

摘要

Cloud computing offers new computing paradigms, capacity, and flexibility to high performance computing (HPC) applications with provisioning of a large number of Virtual Machines (VMs) for computation-intensive applications using the Hardware as a Service (HaaS) model. Due, however, to the large number of VMs and electronic components in HPC systems in the cloud, any fault during the execution would result in re-running the application, which will cost time, money and energy. In this paper we present a proactive Fault Tolerance (FT) approach to HPC systems in the cloud to reduce the wall clock execution time in the presence of faults. We develop a generic FT algorithm for HPC systems in the cloud. Our algorithm does not rely on a spare node prior to prediction of a failure. We analyze the dollar cost of provisioning spare nodes to assess the value of our approach. Our experimental results obtained from a real cloud execution environment show that the wall clock execution time of the computation-intensive applications in cloud can be reduced by as much as 30%. The frequency of check pointing of computation-intensive applications can be reduced to 50% with our fault tolerance approach for HPC in the cloud, compared to current FT approaches.
机译:云计算为高性能计算(HPC)应用提供了新的计算范例,容量和灵活性,具有使用硬件作为服务(HAAS)模型的计算密集型应用程序提供大量虚拟机(VM)。然而,由于云中的HPC系统中的大量虚拟机和电子元件,执行期间的任何故障将导致重新运行应用程序,这将花费时间,金钱和能量。在本文中,我们在云中提出了一个主动容错(FT)方法,以减少故障存在的壁钟执行时间。我们为云中的HPC系统开发了一种通用的FT算法。我们的算法在预测失败之前不依赖于备用节点。我们分析了供应备用节点的美元成本,以评估我们的方法的价值。我们从真正的云执行环境获得的实验结果表明,云中计算密集型应用的壁钟执行时间可以减少多达30%。与当前FT方法相比,计算密集型应用的检查指向计算密集型应用的频率可以减少到云中HPC的容错方法。

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号