首页> 外文期刊>Journal of Parallel and Distributed Computing >Towards comprehensive dependability-driven resource use and message log-analysis for HPC systems diagnosis
【24h】

Towards comprehensive dependability-driven resource use and message log-analysis for HPC systems diagnosis

机译:朝向综合可靠性驱动的资源使用和消息日志分析,用于HPC系统诊断

获取原文
获取原文并翻译 | 示例

摘要

Failure analysis plays an important role in the reliability of data centers and high-performance computing (HPC) systems. Recent work have shown that both resource use data and failure logs can, separately and together, be used to detect system failure-inducing errors and diagnose system failures; the result of error propagation and (unsuccessful) execution of error recovery mechanisms. For more accurate and detailed failure diagnosis, knowledge of error propagation patterns and unsuccessful error recovery is important. To improve system reliability, knowledge of recovery protocols deployment is important. This paper describes and demonstrates application of a new diagnostics framework (CORRMEXT). CORRMEXT analyzes and reports error propagation patterns and degrees of success and failure of error recovery protocols. The steps in the framework are correlations of resource use metrics and error messages, and identification of the earliest times of change of system behaviour. The framework is illustrated with analyses of resource use data and message logs for three HPC systems operated by the Texas Advanced Computing Center (TACC). The illustrations are focused on groups of resource use counters and groups of errors; they reveal many interesting insights into patterns of: (i) network data and software errors, (ii) Lustre file-system and Linux operating system process errors, and (iii) memory and storage errors. We also confirm that: (i) correlations of resource use and errors can only be identified by applying different correlation algorithms, and (ii) the earliest times of change in system behaviour can only be identified by analyzing both the correlated resource use counters and correlated errors. We believe CORRMEXT is the first tool that have diagnosed error propagation paths and error recovery attempts on three different HPC systems. CORRMEXT will be put on the public domain to support systems administrators in diagnosing HPC system failures, on August 2018. (C) 2019 Elsevier Inc. All rights reserved.
机译:失败分析在数据中心和高性能计算(HPC)系统的可靠性中起着重要作用。最近的工作表明,两个资源使用数据和故障日志都可以单独和一起使用,用于检测系统故障引起的错误和诊断系统故障;错误传播的结果和(不成功)执行错误恢复机制。为了更准确和详细的故障诊断,错误传播模式和错误恢复不成功的知识是重要的。为提高系统可靠性,恢复协议部署的知识很重要。本文介绍并展示了新诊断框架(CorrMext)的应用。 CORRMEGR分析并报告错误传播模式和错误恢复协议的成功和失败。框架中的步骤是资源使用度量和错误消息的相关性,以及识别系统行为的最早变化的时间。该框架用资源分析来说明,用于由TEXAS高级计算中心(TACC)操作的三个HPC系统的资源使用数据和消息日志。插图专注于资源组使用计数器和错误组;它们揭示了许多有趣的见解:(i)网络数据和软件错误,(ii)光泽文件系统和Linux操作系统进程错误,(iii)内存和存储错误。我们还确认:(i)只能通过应用不同的相关算法来识别资源使用和错误的相关性,并且(ii)只能通过分析相关资源使用计数器来识别系统行为的最早变化的最早的变化次数错误。我们相信Corrmext是第一个在三种不同HPC系统上诊断错误传播路径和错误恢复尝试的工具。 CORRMETE将被置于公共领域,以支持系统管理员在诊断HPC系统故障,于2018年8月。(c)2019 Elsevier Inc.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号