Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning

Tuncer Ozan; Ates Emre; Zhang Yijia; Turk Ata; Brandt Jim; Leung Vitus J.; Egele Manuel; Coskun Ayse K.

首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning

【24h】

Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning

机译：使用机器学习对HPC系统中的性能变化进行在线诊断

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

As the size and complexity of high performance computing (HPC) systems grow in line with advancements in hardware and software technology, HPC systems increasingly suffer from performance variations due to shared resource contention as well as software-and hardware-related problems. Such performance variations can lead to failures and inefficiencies, which impact the cost and resilience of HPC systems. To minimize the impact of performance variations, one must quickly and accurately detect and diagnose the anomalies that cause the variations and take mitigating actions. However, it is difficult to identify anomalies based on the voluminous, high-dimensional, and noisy data collected by system monitoring infrastructures. This paper presents a novel machine learning based framework to automatically diagnose performance anomalies at runtime. Our framework leverages historical resource usage data to extract signatures of previously-observed anomalies. We first convert collected time series data into easy-to-compute statistical features. We then identify the features that are required to detect anomalies, and extract the signatures of these anomalies. At runtime, we use these signatures to diagnose anomalies with negligible overhead. We evaluate our framework using experiments on a real-world HPC supercomputer and demonstrate that our approach successfully identifies 98 percent of injected anomalies and consistently outperforms existing anomaly diagnosis techniques.

机译：随着高性能计算（HPC）系统的大小和复杂性随着硬件和软件技术的发展而增长，由于共享资源争用以及与软件和硬件相关的问题，HPC系统越来越遭受性能变化的困扰。这种性能差异可能导致故障和效率低下，从而影响HPC系统的成本和弹性。为了最大程度地降低性能变化的影响，必须快速准确地检测和诊断导致变化的异常并采取缓解措施。但是，很难根据系统监视基础结构收集的大量，高维和嘈杂的数据来识别异常。本文提出了一种新颖的基于机器学习的框架，可在运行时自动诊断性能异常。我们的框架利用历史资源使用情况数据提取先前观察到的异常的特征。我们首先将收集的时间序列数据转换为易于计算的统计特征。然后，我们确定检测异常所需的功能，并提取这些异常的签名。在运行时，我们使用这些签名以可忽略的开销诊断异常。我们使用实际的HPC超级计算机上的实验评估了我们的框架，并证明了我们的方法成功地识别了98％的注入异常，并且始终优于现有的异常诊断技术。

著录项

来源
《IEEE Transactions on Parallel and Distributed Systems》 |2019年第4期|883-896|共14页
作者
Tuncer Ozan; Ates Emre; Zhang Yijia; Turk Ata; Brandt Jim; Leung Vitus J.; Egele Manuel; Coskun Ayse K.;
展开▼
作者单位

Boston Univ, Elect & Comp Engn Dept, Boston, MA 02215 USA;

Boston Univ, Elect & Comp Engn Dept, Boston, MA 02215 USA;

Boston Univ, Elect & Comp Engn Dept, Boston, MA 02215 USA;

Boston Univ, Elect & Comp Engn Dept, Boston, MA 02215 USA;

Sandia Natl Labs, POB 5800, Albuquerque, NM 87185 USA;

Sandia Natl Labs, POB 5800, Albuquerque, NM 87185 USA;

Boston Univ, Elect & Comp Engn Dept, Boston, MA 02215 USA;

Boston Univ, Elect & Comp Engn Dept, Boston, MA 02215 USA;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
High performance computing; anomaly detection; machine learning; performance variation;

机译：高性能计算;异常检测;机器学习;性能变化;

相似文献

外文文献
中文文献
专利

1. Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning [J] . Tuncer Ozan, Ates Emre, Zhang Yijia, IEEE Transactions on Parallel and Distributed Systems . 2019,第4期

机译：使用机器学习的HPC系统性能变化的在线诊断
2. A machine learning approach to online fault classification in HPC systems [J] . Alessio Netti, Zeynep Kiziltan, Ozalp Babaoglu, Future generation computer systems . 2020,第Sepa期

机译：HPC系统在线故障分类的机器学习方法
3. An Approach to Develop Expert Systems in Medical Diagnosis Using Machine Learning Algorithms (Asthma) and A Performance Study [J] . BDCN Prasadl, P. E. S. N Krishna Prasad, Y Sagar International Journal on Soft Computing . 2011,第1期

机译：利用机器学习算法（哮喘）开发医学诊断专家系统的方法和性能研究
4. Diagnosing Performance Variations in HPC Applications Using Machine Learning [C] . Ozan Tuncer, Emre Ates, Yijia Zhang, International conference on high performance computing . 2017

机译：使用机器学习诊断HPC应用程序中的性能差异
5. HPC and Machine Learning Techniques for Reducing the Computation Burden of Determining Time-Evolution of Complex Dynamic Systems [D] . Lakshmiranganatha, Sumathi. 2021

机译：HPC和机器学习技术，用于减少确定复杂动态系统的时间演化的计算负担
6. ERD-Based Online Brain–Machine Interfaces (BMI) in the Context of Neurorehabilitation: Optimizing BMI Learning and Performance [O] . Surjo R. Soekadar, Matthias Witkowski, Jürgen Mellinger, -1

机译：ERD为基础的在线脑机接口（BmI）在神经康复的语境：优化BmI学习与绩效
7. A machine learning approach to online fault classification in HPC systems [O] . Alessio Netti, Zeynep Kiziltan, Ozalp Babaoglu, 2020

机译：HPC系统在线故障分类的机器学习方法

Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅