Fault-Aware Runtime Strategies for High-Performance Computing

Yawei Li; Zhiling Lan; Gujrati P.; Xian-He Sun

首页> 外文期刊>Parallel and Distributed Systems, IEEE Transactions on >Fault-Aware Runtime Strategies for High-Performance Computing

【24h】

Fault-Aware Runtime Strategies for High-Performance Computing

机译：高性能计算的故障感知运行时策略

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

As the scale of parallel systems continues to grow, fault management of these systems is becoming a critical challenge. While existing research mainly focuses on developing or improving fault tolerance techniques, a number of key issues remain open. In this paper, we propose runtime strategies for spare node allocation and job rescheduling in response to failure prediction. These strategies, together with failure predictor and fault tolerance techniques, construct a runtime system called FARS (Fault-Aware Runtime System). In particular, we propose a 0-1 knapsack model and demonstrate its flexibility and effectiveness for reallocating running jobs to avoid failures. Experiments, by means of synthetic data and real traces from production systems, show that FARS has the potential to significantly improve system productivity (i.e., performance and reliability).

机译：随着并行系统规模的不断增长，这些系统的故障管理正成为一个严峻的挑战。虽然现有的研究主要集中在开发或改进容错技术，但仍有许多关键问题尚待解决。在本文中，我们针对故障预测提出了备用节点分配和作业重新调度的运行时策略。这些策略与故障预测器和容错技术一起，构成了一个称为FARS（故障感知运行时系统）的运行时系统。特别是，我们提出了0-1背包模型，并展示了它的灵活性和有效性，可以重新分配运行中的作业以避免失败。通过合成数据和来自生产系统的真实跟踪进行的实验表明，FARS具有显着提高系统生产率（即性能和可靠性）的潜力。

著录项

来源
《Parallel and Distributed Systems, IEEE Transactions on》 |2009年第4期|p.460-473|共14页
作者
Yawei Li; Zhiling Lan; Gujrati P.; Xian-He Sun;
展开▼
作者单位

Dept. of Comput. Sci., Illinois Inst. of Technol., Chicago, IL;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
failure analysis; knapsack problems; parallel processing; scheduling; software fault tolerance; 0-1 knapsack model; failure prediction; fault management; fault tolerance techniques; fault-aware runtime strategies; high-performance computing; job rescheduling; parallel systems; spare node allocation; Fault-tolerance; Performance;

机译：故障分析;背包问题;并行处理;调度;软件容错;0-1背包模型;故障预测;故障管理;容错技术;故障感知运行时策略;高性能计算;作业重新调度;并行系统;备用节点分配;容错;性能;

相似文献

外文文献
中文文献
专利

1. ARTful: A model for user-defined schedulers targeting multiple high-performance computing runtime systems [J] . Santana Alexandre, Freitas Vinicius, Castro Marcio, Software, practice & experience . 2021,第7期

机译：artful：针对多个高性能计算运行时系统的用户定义调度程序的模型
2. A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems [J] . Jian Gao, Hongmei Wei, Kang Yu, International journal of parallel programming . 2018,第4期

机译：高性能计算系统的可扩展运行时故障本地化框架
3. Experimenting with runtime and energy tradeoffs in high-performance computing [J] . G. Uma Maheswari, S. Subha Progress in Industrial Ecology . 2018,第1a2期

机译：在高性能计算中尝试运行时和能量权衡
4. Autonomic Runtime Adaptation Framework for Power Management in Large-Scale High-Performance Computing Systems [C] . Sumit Kumar Saurav, S Bindhumadhva Bapu IEEE India Council International Conference . 2020

机译：大型高性能计算系统中电源管理自动运行时适应框架
5. Reliable high-performance computing strategies for chemical process modeling: Nonlinear parameter estimation. [D] . Gau, Chao-Yang. 2001

机译：用于化学过程建模的可靠的高性能计算策略：非线性参数估计。
6. Scalable analysis of Big pathology image data cohorts using efficient methods and high-performance computing strategies [O] . Tahsin Kurc, Xin Qi, Daihou Wang, 2015

机译：使用高效方法和高性能计算策略可扩展地分析大病理图像数据
7. Fault-Aware Runtime Strategies for High Performance Computing [O] . Yawei Li, Student Member, Zhiling Lan, 2009

机译：用于高性能计算的故障感知运行时策略
8. Asymmetric Core Computing for U.S. Army High-Performance Computing Applications [R] . Shires, D., Park, S. J., Henz, B., 2009

机译：美国陆军高性能计算应用的非对称核心计算

Fault-Aware Runtime Strategies for High-Performance Computing

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅