Failure-Aware Construction and Reconfiguration of Distributed Virtual Machines for High Availability Computing

机译：用于高可用性计算的分布式虚拟机的故障感知构造和重新配置

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

In large-scale clusters and computational grids, component failures become norms instead of exceptions. Failure occurrence as well as its impact on system performance and operation costs have become an increasingly important concern to system designers and administrators. In this paper, we study how to efficiently utilize system resources for high-availability clusters with the support of the virtual machine (VM) technology. We design a reconfigurable distributed virtual machine (RDVM) infrastructure for clusters computing. We propose failure-aware node selection strategies for the construction and reconfiguration of RDVMs. We leverage the proactive failure management techniques in calculating nodes' reliability status. We consider both the performance and reliability status of compute nodes in making selection decisions. We define a capacity-reliability metric to combine the effects of both factors in node selection, and propose best-fit algorithms to find the best qualified nodes on which to instantiate VMs to run parallel jobs. We have conducted experiments using failure traces from production clusters and the NAS parallel benchmark programs on a real cluster. The results show the enhancement of system productivity and dependability by using the proposed strategies. With the best-fit strategies, the job completion rate is increased by 17.6% compared with that achieved in the current LANL HPC cluster, and the task completion rate reaches 91.7%.

机译：在大型集群和计算网格中，组件故障成为规范，而不是例外。故障发生及其对系统性能和操作成本的影响已成为系统设计人员和管理员日益关注的重要问题。在本文中，我们研究如何在虚拟机（VM）技术的支持下有效地将系统资源用于高可用性群集。我们设计了用于群集计算的可重新配置的分布式虚拟机（RDVM）基础结构。我们提出了用于构建和重新配置RDVM的故障感知节点选择策略。我们利用主动故障管理技术来计算节点的可靠性状态。在做出选择决策时，我们会同时考虑计算节点的性能和可靠性状态。我们定义了一种容量-可靠性度量标准，以结合两个因素在节点选择中的作用，并提出最佳拟合算法以找到最佳实例化的节点，以实例化VM来运行并行作业。我们已经使用来自生产集群的故障跟踪和真实集群上的NAS并行基准测试程序进行了实验。结果表明，使用所提出的策略可以提高系统的生产率和可靠性。与当前的LANL HPC集群相比，采用最佳策略的工作完成率提高了17.6％，任务完成率达到了91.7％。

著录项

来源
《Cluster Computing and the Grid, 2009. CCGRID '09》|2009年|p.372-379|共8页
会议地点 Shanghai(CN)
作者
Song Fu;
展开▼
作者单位

Dept. of Comput. Sci., New Mexico Inst. of Min. Technol., Socorro, NM;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类理论、方法;
关键词
grid computing; software performance evaluation; software reliability; virtual machines; NAS parallel benchmark programs; failure management techniques; failure-aware construction; high availability computing; reconfigurable distributed virtual machine infrastructure; system designers; Distributed virtual machines; Failure-aware resource management; System reconfiguration;

机译：网格计算;软件性能评估;软件可靠性;虚拟机; NAS并行基准测试程序;故障管理技术;故障感知构造;高可用性计算;可重新配置的分布式虚拟机基础架构;系统设计人员;分布式虚拟机;故障感知资源管理;系统重新配置;

相似文献

外文文献
中文文献
专利

1. Failure-aware resource management for high-availability computing clusters with distributed virtual machines [J] . Song Fu Journal of Parallel and Distributed Computing . 2010,第4期

机译：具有分布式虚拟机的高可用性计算集群的故障感知资源管理
2. Distributed shared arrays: A distributed virtual machine with mobility support for reconfiguration [J] . Song Fu, Cheng-Zhong Xu, Brian Wims, Cluster computing . 2006,第3期

机译：分布式共享阵列：具有对重新配置的移动性支持的分布式虚拟机
3. Distributed File System Virtualization Techniques Supporting On-Demand Virtual Machine Environments for Grid Computing [J] . MING ZHAO, JIAN ZHANG, RENATO J. FIGUEIREDO Cluster computing . 2006,第1期

机译：支持按需虚拟机环境进行网格计算的分布式文件系统虚拟化技术
4. Failure-Aware Construction and Reconfiguration of Distributed Virtual Machines for High Availability Computing [C] . Song Fu Cluster Computing and the Grid, 2009. CCGRID '09 . 2009

机译：用于高可用性计算的分布式虚拟机的故障感知构造和重新配置
5. Failure-Aware Reconfigurable Distributed Virtual Machine for dependable and high productivity computing. [D] . Fu, Song. 2008

机译：故障感知可重新配置的分布式虚拟机，用于可靠和高生产率的计算。
6. Distributed Drug Discovery Part 2: Global Rehearsal of Alkylating Agents for the Synthesis of Resin-Bound Unnatural Amino Acids and Virtual D3 Catalog Construction [O] . William L. Scott, *, Jordi Alsina, -1

机译：分布式药物发现第2部分：用于树脂结合的非天然氨基酸合成的烷基化剂的全球演练和虚拟D3目录构建
7. On the Design of Virtual Machine Sandboxes for Distributed Computing in Wide-area Overlays of Virtual Workstations [O] . Vladimir Paramygin, Y. Peter Sheng, Renato J. Figueiredo 2016

机译：虚拟工作站广域覆盖分布式计算虚拟机沙箱的设计
8. PVM (Parallel Virtual Machine): A Framework for Parallel Distributed Computing. [R] . Sunderam, V. S. 1989

机译：pVm（并行虚拟机）：并行分布式计算的框架。

Failure-Aware Construction and Reconfiguration of Distributed Virtual Machines for High Availability Computing

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅