Cost-oriented proactive fault tolerance approach to high performance computing (HPC) in the cloud

Ifeanyi P. Egwutuoha; Shiping Chen; David Levy; Bran Selic; Rafael Calvo

首页> 外文期刊>Parallel Algorithms and Applications >Cost-oriented proactive fault tolerance approach to high performance computing (HPC) in the cloud

【24h】

Cost-oriented proactive fault tolerance approach to high performance computing (HPC) in the cloud

机译：面向成本的云中高性能计算（HPC）的主动容错方法

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Cloud computing offers new computing paradigms, capacity and flexible solutions to high performance computing (HPC) applications. For example, Hardware as a Service (HaaS) allows users to provide a large number of virtual machines (VMs) for computation-intensive applications using the HaaS model. Due to the large number of VMs and electronic components in HPC system in the cloud, any fault during the execution would result in re-running the applications, which will cost time, money and energy. In this paper we presented a proactive fault tolerance (FT) approach to HPC systems in the cloud to reduce the wall-clock execution time and dollar cost in the presence of faults. We also developed a generic FT algorithm for HPC systems in the cloud. Our algorithm does not rely on a spare node prior to prediction of a failure. We also developed a cost model for executing computation-intensive applications on HPC systems in the cloud. We analysed the dollar cost of provisioning spare nodes and checkpointing FT to assess the value of our approach. Our experimental results obtained from a real cloud execution environment show that the wall-clock execution time and cost of running computation-intensive applications in cloud can be reduced by as much as 30%. The frequency of checkpointing of computation-intensive applications can be reduced up to 50% with our FT approach for HPC in the cloud compared with current FT approaches.

机译：云计算为高性能计算（HPC）应用程序提供了新的计算范例，容量和灵活的解决方案。例如，硬件即服务（HaaS）允许用户使用HaaS模型为计算密集型应用程序提供大量虚拟机（VM）。由于云中HPC系统中的大量VM和电子组件，执行过程中的任何错误都将导致重新运行应用程序，这将花费时间，金钱和精力。在本文中，我们针对云中的HPC系统提出了一种主动式的容错（FT）方法，以减少出现故障时的挂钟执行时间和美元成本。我们还为云中的HPC系统开发了通用FT算法。我们的算法在预测故障之前并不依赖于备用节点。我们还开发了一种成本模型，用于在云中的HPC系统上执行计算密集型应用程序。我们分析了供应备用节点和检查点FT的美元成本，以评估该方法的价值。我们从真实的云执行环境获得的实验结果表明，挂钟执行时间和在云中运行计算密集型应用程序的成本最多可减少30％。与当前的FT方法相比，使用针对云中HPC的FT方法，可将计算密集型应用程序的检查点频率降低多达50％。

著录项

来源
《Parallel Algorithms and Applications》 |2014年第4期|363-378|共16页
作者
Ifeanyi P. Egwutuoha; Shiping Chen; David Levy; Bran Selic; Rafael Calvo;
展开▼
作者单位

School of Electrical and Information Engineering, The University of Sydney, NSW 2006, Australia;

CSIRO, Information Engineering Laboratory, CSIRO ICT Centre, Sydney, NSW, Australia;

School of Electrical and Information Engineering, The University of Sydney, NSW 2006, Australia;

School of Electrical and Information Engineering, The University of Sydney, NSW 2006, Australia;

School of Electrical and Information Engineering, The University of Sydney, NSW 2006, Australia;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
HPC; Cloud computing; HaaS; proactive fault tolerance; computation-intensive;

机译：HPC;云计算;HaaS;主动的容错能力;计算密集型;

相似文献

外文文献
中文文献
专利

1. Fault Tolerance Approach To Improve Performance Computation Of Biological Jobs Using Cloud Computing. [J] . P Padmakumari Research Journal of Pharmaceutical, Biological and Chemical Sciences . 2016,第2期

机译：使用云计算改进生物作业性能计算的容错方法。
2. Fault Tolerance Approach To Improve Performance Computation Of Biological Jobs Using Cloud Computing. [J] . E. KADIVAR, Kh. RAHIMI, M. A. SHAHZAMANIAN Research Journal of Pharmaceutical, Biological and Chemical Sciences . 2016,第2期

机译：使用云计算改进生物作业性能计算的容错方法。
3. Proactive load balancing fault tolerance algorithm in cloud computing [J] . Attallah Salma M. A., Fayek Magda B., Nassar Salwa M., Concurrency and computation: practice and experience . 2021,第10期

机译：云计算主动负载平衡容错算法
4. A Proactive Fault Tolerance Approach to High Performance Computing (HPC) in the Cloud [C] . Egwutuoha Ifeanyi P., Chen Shiping, Levy David, The Second International Conference on Cloud and Green Computing. . 2012

机译：云中高性能计算（HPC）的主动容错方法
5. Proactive Approach for the Prevention of DDoS Attacks in Cloud Computing Environments. [D] . Alshehry, Badr. 2016

机译：在云计算环境中预防DDoS攻击的主动方法。
6. ELIXIR-IT HPC@CINECA: high performance computing resources for the bioinformatics community [O] . Tiziana Castrignanò, Silvia Gioiosa, Tiziano Flati, 2020

机译：Elixir-IT HPC @ CineCa：生物信息学社区的高性能计算资源
7. A proactive fault tolerance framework for high performance computing (HPC) systems in the cloud [O] . Egwutuoha Ifeanyi Paulinus 2014

机译：云中高性能计算（HPC）系统的主动容错框架
8. High Performance Computing (HPC) Innovation Service Portal Pilots Cloud Computing (HPC-ISP Pilot Cloud Computing) [R] . Hochstein, L. 2011

机译：高性能计算（HpC）创新服务门户飞行员云计算（HpC-Isp试点云计算）

Cost-oriented proactive fault tolerance approach to high performance computing (HPC) in the cloud

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅