首页> 外文期刊>International journal of parallel programming >A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems
【24h】

A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems

机译:高性能计算系统的可扩展运行时故障本地化框架

获取原文
获取原文并翻译 | 示例

摘要

Fault localization has become an increasingly challenging issue in high-performance computing (HPC) systems. Various techniques have been used for HPC systems. However, as the HPC systems scale out, resulting in the rapid deterioration of the existing techniques. In this context, we propose a message-passing based fault localization framework, namely MPFL, which provides a light-weight distributed service using tree-based fault detection (TFD) and fault analysis (TFA) algorithms. In essence, MPFL serves as a fault localization engine within message-passing libraries by enabling several system middleware such as job scheduler to provide abnormal information. We present details of the MPFL framework, including the implementation of TFD and TFA. Further, we develop the fault localization engine prototype within MVAPICH2. The experimental evaluation is performed on a typical HPC cluster with 10 computing nodes, which demonstrate the capability of MPFL and show that the MPFL service does not affect the performance of an application in practice.
机译:故障定位已成为高性能计算(HPC)系统中一个越来越具有挑战性的问题。各种技术已用于HPC系统。但是,随着HPC系统的扩展,导致现有技术的迅速恶化。在这种情况下,我们提出了一个基于消息传递的故障定位框架,即MPFL,该框架使用基于树的故障检测(TFD)和故障分析(TFA)算法来提供轻量级的分布式服务。本质上,MPFL通过使多个系统中间件(如作业调度程序)提供异常信息,从而充当消息传递库中的故障定位引擎。我们介绍了MPFL框架的详细信息,包括TFD和TFA的实施。此外,我们在MVAPICH2中开发了故障定位引擎原型。实验评估是在具有10个计算节点的典型HPC群集上执行的,这些群集演示了MPFL的功能,并表明MPFL服务实际上不会影响应用程序的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号