首页> 外文学位 >High performance computing spare replacement hardware fault tolerance.
【24h】

High performance computing spare replacement hardware fault tolerance.

机译:高性能计算备件替换硬件的容错能力。

获取原文
获取原文并翻译 | 示例

摘要

The use of spare replacement hardware and checkpoint rollback software fault tolerance on multiple-instruction-multiple-data (MIMD) architecture was investigated. New performance results are presented for spare node replacement after simulated failure and migration onto spare node prior to simulated failure. Spare replacement and migration onto spare were implemented for application parameter characterization runs on 32 nodes and scaling runs from 8--128 nodes on a MIMD cluster. The CUMULVS system was used for fault tolerant and control features. We evaluated the spare node replacement and migration onto spare node approaches using runtime to quantify performance and demonstrate viability of the approaches.; The principal new results of this study are that: (1) Spare node replacement provides good performance at a small cost in runtime; (2) Migration onto a spare provides even better performance at a small cost in runtime; and (3) A runtime breakeven point dependent on system scale is identified for both approaches relative to traditional approaches.; Results were quantified for empirical studies on 8--128 nodes. These studies investigated applications characterized by various computation-communication ratios, work patterns (steady, accumulate, disperse, hill, and hole), and various topologies (ring, one-to-all, and near neighbor). The decrease in the cost of commodity hardware enables strategies that can efficiently use a spare as a general means of dynamic redundancy. The gain resulting from these approaches is that because of decreased recovery time (given immediate access to a spare), the mean time to repair (MTTR) is reduced. Checkpoint and rollback overhead is still incurred, but for migration onto a spare, checkpoint overhead can be dramatically reduced. The scale of distributed memory MIMD architectures continue to grow as a result of user requests for greater performance, their increased computational requirements for finer resolution, and the decreasing cost of commodity hardware. However, these larger architectures experience an increasing frequency of component failures and subsequent loss of availability. Fault tolerance and availability are therefore important issues for high performance computing systems executing long-running applications. Our research indicates that utilizing spare replacement enhances scalability and availability of MIMD architectures and that further research will pay important dividends.
机译:研究了在多指令多数据(MIMD)架构上使用备件替换硬件和检查点回滚软件容错能力。提出了新的性能结果,用于模拟故障后替换备用节点以及在模拟故障之前迁移到备用节点。为在32个节点上运行应用程序参数表征,并在MIMD集群上从8--128个节点进行扩展运行,实现了备件更换和向备件的迁移。 CUMULVS系统用于容错和控制功能。我们使用运行时评估了备用节点的替换和向备用节点方法的迁移,以量化性能并证明这些方法的可行性。这项研究的主要新结果是:(1)备用节点替换以较低的运行时间成本提供了良好的性能; (2)迁移到备用磁盘上可以以较低的运行时间成本提供更好的性能; (3)相对于传统方法,确定了两种方法都依赖于系统规模的运行时收支平衡点。对8--128个节点的经验研究结果进行了量化。这些研究调查了以各种计算通信比率,工作模式(稳定,累积,分散,丘陵和孔洞)以及各种拓扑结构(环形,一对多和近邻)为特征的应用程序。商用硬件成本的降低使策略可以有效地将备用设备用作动态冗余的通用手段。这些方法带来的好处是,由于减少了恢复时间(可以立即访问备用组件),因此平均修复时间(MTTR)减少了。仍然会产生检查点和回滚开销,但是要迁移到备用磁盘上,可以大大减少检查点开销。由于用户对更高性能的要求,对更高分辨率的更高计算要求以及商品硬件成本的降低,分布式内存MIMD架构的规模继续增长。但是,这些较大的体系结构遇到组件故障的频率越来越高,随之而来的是可用性的损失。因此,对于执行长时间运行的应用程序的高性能计算系统,容错性和可用性是重要的问题。我们的研究表明,利用备用替换可以增强MIMD体系结构的可扩展性和可用性,并且进一步的研究将带来重要的好处。

著录项

  • 作者

    Dreicer, Jared Samuel.;

  • 作者单位

    The University of New Mexico.;

  • 授予单位 The University of New Mexico.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2004
  • 页码 241 p.
  • 总页数 241
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 自动化技术、计算机技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号