首页> 外文期刊>Parallel and Distributed Systems, IEEE Transactions on >Fault-Aware Runtime Strategies for High-Performance Computing
【24h】

Fault-Aware Runtime Strategies for High-Performance Computing

机译:高性能计算的故障感知运行时策略

获取原文
获取原文并翻译 | 示例

摘要

As the scale of parallel systems continues to grow, fault management of these systems is becoming a critical challenge. While existing research mainly focuses on developing or improving fault tolerance techniques, a number of key issues remain open. In this paper, we propose runtime strategies for spare node allocation and job rescheduling in response to failure prediction. These strategies, together with failure predictor and fault tolerance techniques, construct a runtime system called FARS (Fault-Aware Runtime System). In particular, we propose a 0-1 knapsack model and demonstrate its flexibility and effectiveness for reallocating running jobs to avoid failures. Experiments, by means of synthetic data and real traces from production systems, show that FARS has the potential to significantly improve system productivity (i.e., performance and reliability).
机译:随着并行系统规模的不断增长,这些系统的故障管理正​​成为一个严峻的挑战。虽然现有的研究主要集中在开发或改进容错技术,但仍有许多关键问题尚待解决。在本文中,我们针对故障预测提出了备用节点分配和作业重新调度的运行时策略。这些策略与故障预测器和容错技术一起,构成了一个称为FARS(故障感知运行时系统)的运行时系统。特别是,我们提出了0-1背包模型,并展示了它的灵活性和有效性,可以重新分配运行中的作业以避免失败。通过合成数据和来自生产系统的真实跟踪进行的实验表明,FARS具有显着提高系统生产率(即性能和可靠性)的潜力。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号