首页> 外文期刊>ETRI journal >Scalable Approach to Failure Analysis of High-Performance Computing Systems
【24h】

Scalable Approach to Failure Analysis of High-Performance Computing Systems

机译:高性能计算系统故障分析的可扩展方法

获取原文
           

摘要

Failure analysis is necessary to clarify the root cause of a failure, predict the next time a failure may occur, and improve the performance and reliability of a system. However, it is not an easy task to analyze and interpret failure data, especially for complex systems. Usually, these data are represented using many attributes, and sometimes they are inconsistent and ambiguous. In this paper, we present a scalable approach for the analysis and interpretation of failure data of high-performance computing systems. The approach employs rough sets theory (RST) for this task. The application of RST to a large publicly available set of failure data highlights the main attributes responsible for the root cause of a failure. In addition, it is used to analyze other failure characteristics, such as time between failures, repair times, workload running on a failed node, and failure category. Experimental results show the scalability of the presented approach and its ability to reveal dependencies among different failure characteristics.
机译:故障分析对于弄清故障的根本原因,预测下一次可能发生的故障以及提高系统的性能和可靠性是必要的。但是,分析和解释故障数据并不是一件容易的事,尤其是对于复杂的系统。通常,使用许多属性来表示这些数据,有时它们是不一致且不明确的。在本文中,我们提出了一种可扩展的方法来分析和解释高性能计算系统的故障数据。该方法为此任务采用了粗糙集理论(RST)。将RST应用于大量公开的故障数据集,突出显示了导致故障根本原因的主要属性。此外,它还用于分析其他故障特征,例如故障之间的时间,修复时间,在故障节点上运行的工作负载以及故障类别。实验结果表明了所提出方法的可扩展性及其揭示不同故障特征之间依赖性的能力。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号