首页> 外文会议>IEEE Information Technology, Networking, Electronic and Automation Control Conference >A scalable runtime fault detection mechanism for high performance computing
【24h】

A scalable runtime fault detection mechanism for high performance computing

机译:高性能计算的可扩展运行时故障检测机制

获取原文

摘要

Fault detection is a process of deducing the exact source of an application failure using a set of observed symptoms. However, it has become an increasingly challenging issue in high performance computing (HPC) applications using message-passing interface (MPI). Various runtime fault detection techniques such as Marmot, Umpire and ISP have been used for MPI applications. However, as the MPI applications scale out, their complexity increases proportionally, resulting in the rapid deterioration of the existing runtime fault detection techniques. In this context, we propose a scalable runtime fault detection mechanism, namely SRFD, which provides a distributed lightweight service using tree-based fault detection algorithms at runtime. In essence, SRFD serves as a fault detection engine within message-passing libraries by logically building all application processes into a tree topology, and designing the fault report and analysis algorithms with pertinence. We present details of the SRFD mechanism, including the implementation of the fault report and analysis algorithms. Further, we develop the fault detection engine prototype within MVAPICH2. The experimental evaluation is performed on a typical HPC cluster with 24 computing nodes, which demonstrate the capability of SRFD by detecting common faults such as deadlock, invalid argument and type matching.
机译:故障检测是使用一组观察到的症状推导申请失败的确切来源的过程。然而,它已成为使用消息传递接口(MPI)的高性能计算(HPC)应用中越来越具有挑战性的问题。各种运行时故障检测技术,如MARMOT,umpire和ISP已用于MPI应用程序。然而,随着MPI应用的缩放,它们的复杂性成比例地增加,导致现有的运行时故障检测技术的快速劣化。在此上下文中,我们提出了一种可扩展的运行时故障检测机制,即SRFD,它在运行时使用基于树的故障检测算法提供分布式轻量级服务。实质上,SRFD通过逻辑构​​建所有应用程序进入树拓扑,并设计故障报告和分析算法,用作消息传递库中的故障检测引擎。我们介绍了SRFD机制的详细信息,包括执行故障报告和分析算法。此外,我们在MVAPICH2中开发故障检测引擎原型。在具有24个计算节点的典型HPC集群上执行实验评估,该节点通过检测常见故障(如死锁,无效的参数和类型匹配)来演示SRFD的能力。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号