首页> 外文OA文献 >A proactive fault tolerance framework for high performance computing (HPC) systems in the cloud
【2h】

A proactive fault tolerance framework for high performance computing (HPC) systems in the cloud

机译:云中高性能计算(HPC)系统的主动容错框架

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

High Performance Computing (HPC) systems have been widely used by scientists and researchers in both industry and university laboratories to solve advanced computation problems. Most advanced computation problems are either data-intensive or computation-intensive. They may take hours, days or even weeks to complete execution. For example, some of the traditional HPC systems computations run on 100,000 processors for weeks. Consequently traditional HPC systems often require huge capital investments. As a result, scientists and researchers sometimes have to wait in long queues to access shared, expensive HPC systems.udCloud computing, on the other hand, offers new computing paradigms, capacity, and flexible solutions for both business and HPC applications. Some of the computation-intensive applications that are usually executed in traditional HPC systems can now be executed in the cloud. Cloud computing price model eliminates huge capital investments.udHowever, even for cloud-based HPC systems, fault tolerance is still an issue of growing concern. The large number of virtual machines and electronic components, as well as software complexity and overall system reliability, availability and serviceability (RAS), are factors with which HPC systems in the cloud must contend. The reactive fault tolerance approach of checkpoint/restart, which is commonly used in HPC systems, does not scale well in the cloud due to resource sharing and distributed systems networks. Hence, the need for reliable fault tolerant HPC systems is even greater in a cloud environment.udIn this thesis we present a proactive fault tolerance approach to HPC systems in the cloud to reduce the wall-clock execution time, as well as dollar cost, in the presence of hardware failure. We have developed a generic fault tolerance algorithm for HPC systems in the cloud. We have further developed a cost model for executing computation-intensive applications on HPC systems in the cloud. Our experimental results obtained from a real cloud execution environment show that the wall-clock execution time and cost of running computation-intensive applications in the cloud can be considerably reduced compared to checkpoint and redundancy techniques used in traditional HPC systems.
机译:高性能计算(HPC)系统已被工业和大学实验室的科学家和研究人员广泛使用,以解决高级计算问题。最高级的计算问题是数据密集型或计算密集型。他们可能需要数小时,数天甚至数周才能完成执行。例如,某些传统的HPC系统计算在100,000个处理器上运行数周。因此,传统的高性能计算系统通常需要大量的资本投资。结果,科学家和研究人员有时不得不排长队才能访问共享的,昂贵的HPC系统。另一方面,udCloud计算为业务和HPC应用程序提供了新的计算范例,容量和灵活的解决方案。通常可以在传统HPC系统中执行的一些计算密集型应用程序现在可以在云中执行。云计算价格模型消除了巨额的资本投资。 ud但是,即使对于基于云的HPC系统,容错性仍然是一个日益受到关注的问题。大量的虚拟机和电子组件以及软件的复杂性和整体系统的可靠性,可用性和可维护性(RAS)是云中HPC系统必须应对的因素。 HPC系统中常用的检查点/重新启动的反应式容错方法由于资源共享和分布式系统网络而无法在云中很好地扩展。因此,在云环境中,对可靠的容错HPC系统的需求甚至更大。 ud本文中,我们提出了一种针对云中HPC系统的主动容错方法,以减少挂钟执行时间以及美元成本,存在硬件故障。我们为云中的HPC系统开发了通用的容错算法。我们进一步开发了一种成本模型,用于在云中的HPC系统上执行计算密集型应用程序。我们从真实的云执行环境获得的实验结果表明,与传统HPC系统中使用的检查点和冗余技术相比,可以大幅减少壁钟的执行时间和在云中运行计算密集型应用程序的成本。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号