High performance computing spare replacement hardware fault tolerance.

机译：高性能计算备件替换硬件的容错能力。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

The use of spare replacement hardware and checkpoint rollback software fault tolerance on multiple-instruction-multiple-data (MIMD) architecture was investigated. New performance results are presented for spare node replacement after simulated failure and migration onto spare node prior to simulated failure. Spare replacement and migration onto spare were implemented for application parameter characterization runs on 32 nodes and scaling runs from 8--128 nodes on a MIMD cluster. The CUMULVS system was used for fault tolerant and control features. We evaluated the spare node replacement and migration onto spare node approaches using runtime to quantify performance and demonstrate viability of the approaches.; The principal new results of this study are that: (1) Spare node replacement provides good performance at a small cost in runtime; (2) Migration onto a spare provides even better performance at a small cost in runtime; and (3) A runtime breakeven point dependent on system scale is identified for both approaches relative to traditional approaches.; Results were quantified for empirical studies on 8--128 nodes. These studies investigated applications characterized by various computation-communication ratios, work patterns (steady, accumulate, disperse, hill, and hole), and various topologies (ring, one-to-all, and near neighbor). The decrease in the cost of commodity hardware enables strategies that can efficiently use a spare as a general means of dynamic redundancy. The gain resulting from these approaches is that because of decreased recovery time (given immediate access to a spare), the mean time to repair (MTTR) is reduced. Checkpoint and rollback overhead is still incurred, but for migration onto a spare, checkpoint overhead can be dramatically reduced. The scale of distributed memory MIMD architectures continue to grow as a result of user requests for greater performance, their increased computational requirements for finer resolution, and the decreasing cost of commodity hardware. However, these larger architectures experience an increasing frequency of component failures and subsequent loss of availability. Fault tolerance and availability are therefore important issues for high performance computing systems executing long-running applications. Our research indicates that utilizing spare replacement enhances scalability and availability of MIMD architectures and that further research will pay important dividends.

机译：研究了在多指令多数据（MIMD）架构上使用备件替换硬件和检查点回滚软件容错能力。提出了新的性能结果，用于模拟故障后替换备用节点以及在模拟故障之前迁移到备用节点。为在32个节点上运行应用程序参数表征，并在MIMD集群上从8--128个节点进行扩展运行，实现了备件更换和向备件的迁移。 CUMULVS系统用于容错和控制功能。我们使用运行时评估了备用节点的替换和向备用节点方法的迁移，以量化性能并证明这些方法的可行性。这项研究的主要新结果是：（1）备用节点替换以较低的运行时间成本提供了良好的性能；（2）迁移到备用磁盘上可以以较低的运行时间成本提供更好的性能；（3）相对于传统方法，确定了两种方法都依赖于系统规模的运行时收支平衡点。对8--128个节点的经验研究结果进行了量化。这些研究调查了以各种计算通信比率，工作模式（稳定，累积，分散，丘陵和孔洞）以及各种拓扑结构（环形，一对多和近邻）为特征的应用程序。商用硬件成本的降低使策略可以有效地将备用设备用作动态冗余的通用手段。这些方法带来的好处是，由于减少了恢复时间（可以立即访问备用组件），因此平均修复时间（MTTR）减少了。仍然会产生检查点和回滚开销，但是要迁移到备用磁盘上，可以大大减少检查点开销。由于用户对更高性能的要求，对更高分辨率的更高计算要求以及商品硬件成本的降低，分布式内存MIMD架构的规模继续增长。但是，这些较大的体系结构遇到组件故障的频率越来越高，随之而来的是可用性的损失。因此，对于执行长时间运行的应用程序的高性能计算系统，容错性和可用性是重要的问题。我们的研究表明，利用备用替换可以增强MIMD体系结构的可扩展性和可用性，并且进一步的研究将带来重要的好处。

著录项

作者
Dreicer, Jared Samuel.;
展开▼
作者单位

The University of New Mexico.;

展开▼
授予单位 The University of New Mexico.;
学科 Computer Science.
学位 Ph.D.
年度 2004
页码 241 p.
总页数 241
原文格式 PDF
正文语种 eng
中图分类自动化技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. The full-use-of-suitable-spares (FUSS) approach to hardware reconfiguration for fault-tolerant processor arrays [J] . Chean M., Fortes J.A.B. IEEE Transactions on Computers . 1990,第4期

机译：充分利用备用资源（FUSS）方法对容错处理器阵列进行硬件重新配置
2. SEDC-Based Hardware-Level Fault Tolerance and Fault Secure Checker Design for Big Data and Cloud Computing [J] . Siddiqui Zahid Ali, Lee Jeong-A, Park Unsang Scientific programming . 2018,第PTa1期

机译：基于SEDC的硬件级容错和大数据和云计算的故障安全检查器设计
3. SEDC-Based Hardware-Level Fault Tolerance and Fault Secure Checker Design for Big Data and Cloud Computing [J] . Zahid Ali Siddiqui, Jeong-A Lee, Unsang Park Scientific programming . 2018,第1期

机译：基于SEDC的大数据和云计算的硬件级容错和故障安全检查器设计
4. Built-in self-reconfiguring systems for fault tolerant mesh-connected processor arrays by direct spare replacement [C] . Takanami, I. . 2001

机译：通过直接备用更换，用于容错网格连接的处理器阵列的内置自重新配置系统
5. Software implemented hardware fault tolerance. [D] . Oh, Nahmsuk. 2001

机译：软件实现的硬件容错能力。
6. Parameter estimation of qualitative biological regulatory networks on high performance computing hardware [O] . Muhammad Tariq Saeed, Jamil Ahmad, Jan Baumbach, 2018

机译：高性能计算硬件上定性生物调控网络的参数估计
7. Matching transient hardware faults with the point in code where they manifested: study in the context of fault-tolerant computing [O] . Μήτσης Παναγιώτης Χ. 2014

机译：将瞬态硬件故障与代码中显示的点相匹配：在容错计算环境中进行研究
8. Bridging the Gap between Hardware and Software Fault Tolerance. [R] . Patino-Martinez, M., Jimenez-Peris, R., Romanovsky, A. 2003

机译：缩小硬件和软件容错之间的差距。

High performance computing spare replacement hardware fault tolerance.

摘要

著录项

相似文献

相关主题

期刊订阅