首页> 外文期刊>Journal of Parallel and Distributed Computing >Towards comprehensive dependability-driven resource use and message log-analysis for HPC systems diagnosis
【24h】

Towards comprehensive dependability-driven resource use and message log-analysis for HPC systems diagnosis

机译:寻求基于可靠性的综合资源使用和消息日志分析,以进行HPC系统诊断

获取原文
获取原文并翻译 | 示例

摘要

Failure analysis plays an important role in the reliability of data centers and high-performance computing (HPC) systems. Recent work have shown that both resource use data and failure logs can, separately and together, be used to detect system failure-inducing errors and diagnose system failures; the result of error propagation and (unsuccessful) execution of error recovery mechanisms. For more accurate and detailed failure diagnosis, knowledge of error propagation patterns and unsuccessful error recovery is important. To improve system reliability, knowledge of recovery protocols deployment is important. This paper describes and demonstrates application of a new diagnostics framework (CORRMEXT). CORRMEXT analyzes and reports error propagation patterns and degrees of success and failure of error recovery protocols. The steps in the framework are correlations of resource use metrics and error messages, and identification of the earliest times of change of system behaviour. The framework is illustrated with analyses of resource use data and message logs for three HPC systems operated by the Texas Advanced Computing Center (TACC). The illustrations are focused on groups of resource use counters and groups of errors; they reveal many interesting insights into patterns of: (i) network data and software errors, (ii) Lustre file-system and Linux operating system process errors, and (iii) memory and storage errors. We also confirm that: (i) correlations of resource use and errors can only be identified by applying different correlation algorithms, and (ii) the earliest times of change in system behaviour can only be identified by analyzing both the correlated resource use counters and correlated errors. We believe CORRMEXT is the first tool that have diagnosed error propagation paths and error recovery attempts on three different HPC systems. CORRMEXT will be put on the public domain to support systems administrators in diagnosing HPC system failures, on August 2018. (C) 2019 Elsevier Inc. All rights reserved.
机译:故障分析在数据中心和高性能计算(HPC)系统的可靠性中起着重要作用。最近的工作表明,资源使用数据和故障日志可以分别或一起用于检测导致系统故障的错误并诊断系统故障。错误传播和错误恢复机制(未成功执行)的结果。对于更准确和详细的故障诊断,了解错误传播模式和错误恢复错误很重要。为了提高系统可靠性,恢复协议部署的知识很重要。本文描述并演示了新的诊断框架(CORRMEXT)的应用。 CORRMEXT分析并报告错误传播模式以及错误恢复协议的成功和失败程度。框架中的步骤是资源使用指标和错误消息的相关性,以及标识系统行为的最早更改时间。通过分析德克萨斯高级计算中心(TACC)运营的三个HPC系统的资源使用数据和消息日志来说明该框架。这些插图集中于资源使用计数器组和错误组。它们揭示了以下模式的许多有趣见解:(i)网络数据和软件错误,(ii)光泽文件系统和Linux操作系统进程错误以及(iii)内存和存储错误。我们还确认:(i)资源使用和错误之间的关联只能通过应用不同的关联算法来识别,并且(ii)系统行为变化的最早时间只能通过分析相关的资源使用计数器和关联的资源来识别错误。我们相信CORRMEXT是第一个在三种不同的HPC系统上诊断错误传播路径和错误恢复尝试的工具。 CORRMEXT将于2018年8月发布到公共领域,以支持系统管理员诊断HPC系统故障。(C)2019 Elsevier Inc.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号