Methodologies for advance warning of compute cluster problems via statistical analysis

机译：通过统计分析提前警告计算集群问题的方法

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

The ability to predict impending failures (hardware or software) on large scale high performance compute (HPC) platforms, augmented by checkpoint mechanisms could drastically increase the scalability of applications and efficiency of platforms. In this paper we present our findings and methodologies employed to date in our search for reliable, advance indicators of failures on a 288 node, 4608 core, Opteron based cluster in production use at Sandia National Laboratories. In support of this effort we have deployed OVIS, a Sandia-developed scalable HPC monitoring, analysis, and visualization tool designed for this purpose. We demonstrate that for a particular error case, statistical analysis using OVIS would enable advanced warning of cluster problems on timescales that would enable application and system administrator response in advance of errors, subsequent system error log reporting, and job failures. This is significant as the utility of detecting such indicators depends on how far in advance of failure they can be recognized and how reliable they are.

机译：通过检查点机制增强的，在大规模高性能计算（HPC）平台上预测即将发生的故障（硬件或软件）的能力可以大大提高应用程序的可伸缩性和平台效率。在本文中，我们介绍了迄今为止在桑迪亚国家实验室用于生产使用的288节点，4608核心，基于Opteron的集群上寻找可靠，先进的故障指标时所采用的发现和方法。为了支持这项工作，我们部署了OVIS，这是由Sandia开发的可扩展HPC监视，分析和可视化工具，专为此目的而设计。我们证明，对于特定的错误情况，使用OVIS进行统计分析将可以在时间范围内对集群问题进行高级警告，从而使应用程序和系统管理员可以在错误，后续系统错误日志报告和作业失败之前提前做出响应。这很重要，因为检测此类指标的实用性取决于可以提前识别故障多少以及其可靠性如何。

著录项

来源
《Proceedings of the 2009 workshop on Resiliency in high performance》|2009年|P.7 - 14|共8页
会议地点 Garching(DE)
作者
Jim Brandt; Ann Gentile; Jackson Mayo; Philippe Pebay; Diana Roe; David Thompson; Matthew Wong;
展开▼
作者单位

Sandia National Laboratories;

Sandia National Laboratories;

Sandia National Laboratories;

Sandia National Laboratories;

Sandia National Laboratories;

Sandia National Laboratories;

Sandia National Laboratories;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类计算技术、计算机技术;
关键词
failure prediction; fault tolerance; ras; reliability;

机译：故障预测;容错; ras;可靠性;;

相似文献

外文文献
中文文献
专利

1. Advancing Health-Related Cluster Analysis Methodology: Quantification of Pairwise Activity Cluster Similarities [J] . Ferrar Katia, Maher Carol, Petkov John, Journal of physical activity & health . 2015,第3期

机译：推进与健康相关的聚类分析方法：成对活动聚类相似性的量化
2. Methodology for statistical analysis comparing the algorithms performance: case of study in virtual environments in private cloud computing [J] . Ricardo Soares Boaventura, Keiji Yamanaka, Gustavo Prado Oliveira, Latin America transactions . 2017,第2期

机译：比较算法性能的统计分析方法：在私有云计算的虚拟环境中进行研究的案例
3. A regional-scale landslide early warning methodology applying statistical and physically based approaches in sequence [J] . Park Joon-Young, Lee Seung-Rae, Lee Deuk-Hwan, Engineering Geology . 2019,第期

机译：区域级滑坡预警方法序列应用统计和物理基础的方法
4. Methodologies for Advance Warning of Compute Cluster Problems via Statistical Analysis: A Case Study [C] . Jim Brandt, Philippe Pebay, Ann Gentile, Workshop on resiliency in high-performance computing 2009 . 2009

机译：通过统计分析对计算聚类问题进行预警的方法：一个案例研究
5. Statistical Modeling of Carbon Dioxide and Cluster Analysis of Time Dependent Information: Lag Target Time Series Clustering, Multi-Factor Time Series Clustering, and Multi-Level Time Series Clustering [D] . Kim, Doo Young. 2016

机译：二氧化碳的统计建模和时间相关信息的聚类分析：滞后目标时间序列聚类，多因素时间序列聚类和多级时间序列聚类
6. Methodologies for Medical Computing. Date Bases and Management Database Management: Development of a Friendly Self-Teaching Interactive Statistical Package for Analysis of Clinical Research Data: The BRIGHT STAT-PACK [O] . D. Rodbard, B. R. Cole, P. J. Munson 1983

机译：医学计算方法。数据基础和管理数据库管理：开发友好的自学式交互式的统计软件包以分析临床研究数据：BRIGHT STAT-PACK
7. Analysis of clinical flow cytometric immunophenotyping data by clustering on statistical manifolds: Treating flow cytometry data as high-dimensional objects How to cite this article: Finn WG, Carter KM, Raich R, Stoolman LM, Hero AO. Analysis of clinical flow cytometric immunophenotyping data by clustering on statistical manifolds: Treating flow cytometry data as high-dimensional objects. Cytometry Part B 2009; 76B: 1–7. [O] . Finn, William G., Carter, Kevin M., Raich, Raviv, 2009

机译：通过聚类统计流形分析临床流式细胞免疫表型数据：将流式细胞术数据作为高维物体处理如何引用本文：Finn WG，Carter Km，Raich R，stoolman Lm，Hero aO。通过聚类在统计流形上分析临床流式细胞免疫表型分析数据：将流式细胞术数据作为高维物体处理。细胞计数B部分2009; 76B：1-7。

Methodologies for advance warning of compute cluster problems via statistical analysis

摘要

著录项

相似文献

相关主题

期刊订阅