首页> 外文会议>Proceedings of the 2009 workshop on Resiliency in high performance >Methodologies for advance warning of compute cluster problems via statistical analysis
【24h】

Methodologies for advance warning of compute cluster problems via statistical analysis

机译:通过统计分析提前警告计算集群问题的方法

获取原文
获取原文并翻译 | 示例

摘要

The ability to predict impending failures (hardware or software) on large scale high performance compute (HPC) platforms, augmented by checkpoint mechanisms could drastically increase the scalability of applications and efficiency of platforms. In this paper we present our findings and methodologies employed to date in our search for reliable, advance indicators of failures on a 288 node, 4608 core, Opteron based cluster in production use at Sandia National Laboratories. In support of this effort we have deployed OVIS, a Sandia-developed scalable HPC monitoring, analysis, and visualization tool designed for this purpose. We demonstrate that for a particular error case, statistical analysis using OVIS would enable advanced warning of cluster problems on timescales that would enable application and system administrator response in advance of errors, subsequent system error log reporting, and job failures. This is significant as the utility of detecting such indicators depends on how far in advance of failure they can be recognized and how reliable they are.
机译:通过检查点机制增强的,在大规模高性能计算(HPC)平台上预测即将发生的故障(硬件或软件)的能力可以大大提高应用程序的可伸缩性和平台效率。在本文中,我们介绍了迄今为止在桑迪亚国家实验室用于生产使用的288节点,4608核心,基于Opteron的集群上寻找可靠,先进的故障指标时所采用的发现和方法。为了支持这项工作,我们部署了OVIS,这是由Sandia开发的可扩展HPC监视,分析和可视化工具,专为此目的而设计。我们证明,对于特定的错误情况,使用OVIS进行统计分析将可以在时间范围内对集群问题进行高级警告,从而使应用程序和系统管理员可以在错误,后续系统错误日志报告和作业失败之前提前做出响应。这很重要,因为检测此类指标的实用性取决于可以提前识别故障多少以及其可靠性如何。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号