首页> 外文会议>Cluster Computing and the Grid, 2009. CCGRID '09 >Failure-Aware Construction and Reconfiguration of Distributed Virtual Machines for High Availability Computing
【24h】

Failure-Aware Construction and Reconfiguration of Distributed Virtual Machines for High Availability Computing

机译:用于高可用性计算的分布式虚拟机的故障感知构造和重新配置

获取原文
获取原文并翻译 | 示例

摘要

In large-scale clusters and computational grids, component failures become norms instead of exceptions. Failure occurrence as well as its impact on system performance and operation costs have become an increasingly important concern to system designers and administrators. In this paper, we study how to efficiently utilize system resources for high-availability clusters with the support of the virtual machine (VM) technology. We design a reconfigurable distributed virtual machine (RDVM) infrastructure for clusters computing. We propose failure-aware node selection strategies for the construction and reconfiguration of RDVMs. We leverage the proactive failure management techniques in calculating nodes' reliability status. We consider both the performance and reliability status of compute nodes in making selection decisions. We define a capacity-reliability metric to combine the effects of both factors in node selection, and propose best-fit algorithms to find the best qualified nodes on which to instantiate VMs to run parallel jobs. We have conducted experiments using failure traces from production clusters and the NAS parallel benchmark programs on a real cluster. The results show the enhancement of system productivity and dependability by using the proposed strategies. With the best-fit strategies, the job completion rate is increased by 17.6% compared with that achieved in the current LANL HPC cluster, and the task completion rate reaches 91.7%.
机译:在大型集群和计算网格中,组件故障成为规范,而不是例外。故障发生及其对系统性能和操作成本的影响已成为系统设计人员和管理员日益关注的重要问题。在本文中,我们研究如何在虚拟机(VM)技术的支持下有效地将系统资源用于高可用性群集。我们设计了用于群集计算的可重新配置的分布式虚拟机(RDVM)基础结构。我们提出了用于构建和重新配置RDVM的故障感知节点选择策略。我们利用主动故障管理技术来计算节点的可靠性状态。在做出选择决策时,我们会同时考虑计算节点的性能和可靠性状态。我们定义了一种容量-可靠性度量标准,以结合两个因素在节点选择中的作用,并提出最佳拟合算法以找到最佳实例化的节点,以实例化VM来运行并行作业。我们已经使用来自生产集群的故障跟踪和真实集群上的NAS并行基准测试程序进行了实验。结果表明,使用所提出的策略可以提高系统的生产率和可靠性。与当前的LANL HPC集群相比,采用最佳策略的工作完成率提高了17.6%,任务完成率达到了91.7%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号