首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning
【24h】

Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning

机译:使用机器学习的HPC系统性能变化的在线诊断

获取原文
获取原文并翻译 | 示例

摘要

As the size and complexity of high performance computing (HPC) systems grow in line with advancements in hardware and software technology, HPC systems increasingly suffer from performance variations due to shared resource contention as well as software-and hardware-related problems. Such performance variations can lead to failures and inefficiencies, which impact the cost and resilience of HPC systems. To minimize the impact of performance variations, one must quickly and accurately detect and diagnose the anomalies that cause the variations and take mitigating actions. However, it is difficult to identify anomalies based on the voluminous, high-dimensional, and noisy data collected by system monitoring infrastructures. This paper presents a novel machine learning based framework to automatically diagnose performance anomalies at runtime. Our framework leverages historical resource usage data to extract signatures of previously-observed anomalies. We first convert collected time series data into easy-to-compute statistical features. We then identify the features that are required to detect anomalies, and extract the signatures of these anomalies. At runtime, we use these signatures to diagnose anomalies with negligible overhead. We evaluate our framework using experiments on a real-world HPC supercomputer and demonstrate that our approach successfully identifies 98 percent of injected anomalies and consistently outperforms existing anomaly diagnosis techniques.
机译:随着高性能计算(HPC)系统的尺寸和复杂性与硬件和软件技术的进步达到,HPC系统由于共享资源竞争以及软件和硬件相关问题而越来越受到性能变化。这种性能变化可能导致故障和效率低下,这影响了HPC系统的成本和恢复性。为了最大限度地减少性能变化的影响,必须快速准确地检测和诊断导致变化并采取缓解动作的异常。然而,难以根据系统监测基础设施收集的大量,高维和嘈杂数据识别异常。本文提出了一种基于新型机器学习的框架,可在运行时自动诊断性能异常。我们的框架利用历史资源使用数据来提取先前观察到的异常的签名。我们首先将收集的时间序列数据转换为易于计算的统计功能。然后,我们确定检测异常所需的功能,并提取这些异常的签名。在运行时,我们使用这些签名来诊断具有可忽略的开销的异常。我们使用实验评估我们的框架,使用真实的HPC超级计算机上的实验证明我们的方法成功地识别了98%的注射异常,并始终如一地优于现有的异常诊断技术。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号