首页> 外文期刊>Journal of Parallel and Distributed Computing >Failure-aware resource management for high-availability computing clusters with distributed virtual machines
【24h】

Failure-aware resource management for high-availability computing clusters with distributed virtual machines

机译:具有分布式虚拟机的高可用性计算集群的故障感知资源管理

获取原文
获取原文并翻译 | 示例

摘要

In large-scale networked computing systems, component failures become norms instead of exceptions. Failure-aware resource management is crucial for enhancing system availability and achieving high performance. In this paper, we study how to efficiently utilize system resources for high-availability computing with the support of virtual machine (VM) technology. We design a reconfigurable distributed virtual machine (RDVM) infrastructure for networked computing systems. We propose failure-aware node selection strategies for the construction and reconfiguration of RDVMs. We leverage the proactive failure management techniques in calculating nodes' reliability states. We consider both the performance and reliability status of compute nodes in making selection decisions. We define a capacity-reliability metric to combine the effects of both factors in node selection, and propose Best-fit algorithms with optimistic and pessimistic selection strategies to find the best qualified nodes on which to instantiate VMs to run user jobs. We have conducted experiments using failure traces from production systems and the NAS Parallel Benchmark programs on a real-world cluster system. The results show the enhancement of system productivity by using the proposed strategies with practically achievable accuracy of failure prediction. With the Best-fit strategies, the job completion rate is increased by 17.6% compared with that achieved in the current LANL HPC cluster. The task completion rate reaches 91.7% with 83.6% utilization of relatively unreliable nodes.
机译:在大规模的网络计算系统中,组件故障成为规范,而不是例外。具备故障意识的资源管理对于增强系统可用性和实现高性能至关重要。在本文中,我们研究了如何在虚拟机(VM)技术的支持下有效地利用系统资源进行高可用性计算。我们为网络计算系统设计了可重新配置的分布式虚拟机(RDVM)基础结构。我们提出了用于构建和重新配置RDVM的故障感知节点选择策略。我们利用主动故障管理技术来计算节点的可靠性状态。在做出选择决策时,我们会同时考虑计算节点的性能和可靠性状态。我们定义了一种容量-可靠性度量标准,以结合两个因素在节点选择中的作用,并提出最佳拟合算法以及乐观和悲观的选择策略,以找到最佳实例化的节点,以实例化VM来运行用户作业。我们使用来自生产系统的故障跟踪和真实集群系统上的NAS Parallel Benchmark程序进行了实验。结果表明,使用所提出的策略可以提高系统生产率,并且可以实际实现故障预测的准确性。与当前的LANL HPC集群相比,采用最佳策略可以使工作完成率提高17.6%。相对不可靠节点的利用率为91.7%,利用率为83.6%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号