首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning
【24h】

Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning

机译:使用机器学习对HPC系统中的性能变化进行在线诊断

获取原文
获取原文并翻译 | 示例

摘要

As the size and complexity of high performance computing (HPC) systems grow in line with advancements in hardware and software technology, HPC systems increasingly suffer from performance variations due to shared resource contention as well as software-and hardware-related problems. Such performance variations can lead to failures and inefficiencies, which impact the cost and resilience of HPC systems. To minimize the impact of performance variations, one must quickly and accurately detect and diagnose the anomalies that cause the variations and take mitigating actions. However, it is difficult to identify anomalies based on the voluminous, high-dimensional, and noisy data collected by system monitoring infrastructures. This paper presents a novel machine learning based framework to automatically diagnose performance anomalies at runtime. Our framework leverages historical resource usage data to extract signatures of previously-observed anomalies. We first convert collected time series data into easy-to-compute statistical features. We then identify the features that are required to detect anomalies, and extract the signatures of these anomalies. At runtime, we use these signatures to diagnose anomalies with negligible overhead. We evaluate our framework using experiments on a real-world HPC supercomputer and demonstrate that our approach successfully identifies 98 percent of injected anomalies and consistently outperforms existing anomaly diagnosis techniques.
机译:随着高性能计算(HPC)系统的大小和复杂性随着硬件和软件技术的发展而增长,由于共享资源争用以及与软件和硬件相关的问题,HPC系统越来越遭受性能变化的困扰。这种性能差异可能导致故障和效率低下,从而影响HPC系统的成本和弹性。为了最大程度地降低性能变化的影响,必须快速准确地检测和诊断导致变化的异常并采取缓解措施。但是,很难根据系统监视基础结构收集的大量,高维和嘈杂的数据来识别异常。本文提出了一种新颖的基于机器学习的框架,可在运行时自动诊断性能异常。我们的框架利用历史资源使用情况数据提取先前观察到的异常的特征。我们首先将收集的时间序列数据转换为易于计算的统计特征。然后,我们确定检测异常所需的功能,并提取这些异常的签名。在运行时,我们使用这些签名以可忽略的开销诊断异常。我们使用实际的HPC超级计算机上的实验评估了我们的框架,并证明了我们的方法成功地识别了98%的注入异常,并且始终优于现有的异常诊断技术。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号