首页> 外文期刊>Future generation computer systems >A machine learning approach to online fault classification in HPC systems
【24h】

A machine learning approach to online fault classification in HPC systems

机译:HPC系统在线故障分类的机器学习方法

获取原文
获取原文并翻译 | 示例

摘要

As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at the hardware and software levels will increase significantly. Thus, detecting and classifying faults in HPC systems as they occur and initiating corrective actions before they can transform into failures becomes essential for continued operation. Central to this objective is fault injection, which is the deliberate triggering of faults in a system so as to observe their behavior in a controlled environment. In this paper, we propose a fault classification method for HPC systems based on machine learning. The novelty of our approach rests with the fact that it can be operated on streamed data in an online manner, thus opening the possibility to devise and enact control actions on the target system in real-time. We introduce a high-level, easy-to-use fault injection tool called FINJ, with a focus on the management of complex experiments. In order to train and evaluate our machine learning classifiers, we inject faults to an in-house experimental HPC system using FINJ, and generate a fault dataset which we describe extensively. Both FINJ and the dataset are publicly available to facilitate resiliency research in the HPC systems field. Experimental results demonstrate that our approach allows almost perfect classification accuracy to be reached for different fault types with low computational overhead and minimal delay.
机译:随着高性能计算(HPC)系统努力实现ExaScale目标,硬件和软件级别的故障率将显着增加。因此,在HPC系统中检测和分类故障在它们发生并启动纠正措施之前,在它们转换为故障之前变为必不可少的操作。这一目标的核心是故障注射,这是故意触发系统中的故障,以便观察他们在受控环境中的行为。本文提出了基于机器学习的HPC系统故障分类方法。我们的方法的新颖性既可以以在线方式在流式数据上运行,从而打开可能在实时设计和制定目标系统上的控制动作。我们介绍了一个名为Finj的高级,易于使用的故障注入工具,重点是复杂实验的管理。为了培训和评估我们的机器学习分类器,我们使用Finj将故障注入内部实验HPC系统,并生成我们广泛描述的故障数据集。 Finj和DataSet都公开可用于促进HPC系统领域的弹性研究。实验结果表明,我们的方法可以为不同的故障类型达到几乎完美的分类准确性,具有低计算开销和最小延迟的不同故障类型。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号