A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems

Jian Gao; Hongmei Wei; Kang Yu; Peng Qing

首页> 外文期刊>International journal of parallel programming >A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems

【24h】

A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems

机译：高性能计算系统的可扩展运行时故障本地化框架

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Fault localization has become an increasingly challenging issue in high-performance computing (HPC) systems. Various techniques have been used for HPC systems. However, as the HPC systems scale out, resulting in the rapid deterioration of the existing techniques. In this context, we propose a message-passing based fault localization framework, namely MPFL, which provides a light-weight distributed service using tree-based fault detection (TFD) and fault analysis (TFA) algorithms. In essence, MPFL serves as a fault localization engine within message-passing libraries by enabling several system middleware such as job scheduler to provide abnormal information. We present details of the MPFL framework, including the implementation of TFD and TFA. Further, we develop the fault localization engine prototype within MVAPICH2. The experimental evaluation is performed on a typical HPC cluster with 10 computing nodes, which demonstrate the capability of MPFL and show that the MPFL service does not affect the performance of an application in practice.

机译：故障定位已成为高性能计算（HPC）系统中一个越来越具有挑战性的问题。各种技术已用于HPC系统。但是，随着HPC系统的扩展，导致现有技术的迅速恶化。在这种情况下，我们提出了一个基于消息传递的故障定位框架，即MPFL，该框架使用基于树的故障检测（TFD）和故障分析（TFA）算法来提供轻量级的分布式服务。本质上，MPFL通过使多个系统中间件（如作业调度程序）提供异常信息，从而充当消息传递库中的故障定位引擎。我们介绍了MPFL框架的详细信息，包括TFD和TFA的实施。此外，我们在MVAPICH2中开发了故障定位引擎原型。实验评估是在具有10个计算节点的典型HPC群集上执行的，这些群集演示了MPFL的功能，并表明MPFL服务实际上不会影响应用程序的性能。

著录项

来源
《International journal of parallel programming》 |2018年第4期|749-761|共13页
作者
Jian Gao; Hongmei Wei; Kang Yu; Peng Qing;
展开▼
作者单位

Jiangnan Institute of Computing Technology;

Jiangnan Institute of Computing Technology;

Jiangnan Institute of Computing Technology;

Jiangnan Institute of Computing Technology;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
High-performance computing; Fault localization; Message-passing; Distributed;

机译：高性能计算;故障定位;消息传递;分布式;

相似文献

外文文献
中文文献
专利

1. Fault-Aware Runtime Strategies for High-Performance Computing [J] . Yawei Li, Zhiling Lan, Gujrati P., Parallel and Distributed Systems, IEEE Transactions on . 2009,第4期

机译：高性能计算的故障感知运行时策略
2. The MegaM@Rt2 ECSEL project: MegaModelling at Runtime - Scalable model-based framework for continuous development and runtime validation of complex systems [J] . Afzal Wasif, Bruneliere Hugo, Di Ruscio Davide, Microprocessors and microsystems . 2018,第SEPa期

机译：MegaM @ Rt2 ECSEL项目：运行时的MegaModelling-可扩展的基于模型的框架，用于复杂系统的连续开发和运行时验证
3. The MegaM@Rt2 ECSEL project: MegaModelling at Runtime - Scalable model-based framework for continuous development and runtime validation of complex systems [J] . Afzal Wasif, Bruneliere Hugo, Di Ruscio Davide, Microprocessors and microsystems . 2018,第Sepa期

机译：MegaM @ Rt2 ECSEL项目：运行时的MegaModelling-可扩展的基于模型的框架，用于复杂系统的连续开发和运行时验证
4. Autonomic Runtime Adaptation Framework for Power Management in Large-Scale High-Performance Computing Systems [C] . Sumit Kumar Saurav, S Bindhumadhva Bapu IEEE India Council International Conference . 2020

机译：大型高性能计算系统中电源管理自动运行时适应框架
5. Designing Scalable and Efficient I/O Middleware for Fault-Resilient High-Performance Computing Clusters [D] . Raja Chandrasekar, Raghunath 2014

机译：为容错的高性能计算集群设计可扩展且高效的I / O中间件
6. FPGA-Based High-Performance Embedded Systems for Adaptive Edge Computing in Cyber-Physical Systems: The ARTICo3 Framework [O] . Alfonso Rodríguez, Juan Valverde, Jorge Portilla, 2018

机译：基于FPGA的高性能嵌入式系统用于网络物理系统中的自适应边缘计算：ARTICo3框架
7. High-Performance Computing Framework Based on Distributed Systems for Large-Scale Neurophysiological Data [O] . Mohsen Hadianpour, Ehsan Rezayat, Mohammad-Reza Dehaqani 2021

机译：基于分布式系统的大型神经生理数据的高性能计算框架

A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅