首页> 外文会议>2017 IEEE 24th International Conference on High Performance Computing >Enabling Dependability-Driven Resource Use and Message Log-Analysis for Cluster System Diagnosis
【24h】

Enabling Dependability-Driven Resource Use and Message Log-Analysis for Cluster System Diagnosis

机译:启用依赖驱动的资源使用和消息日志分析以进行集群系统诊断

获取原文
获取原文并翻译 | 示例

摘要

Recent work have used both failure logs and resource use data separately (and together) to detect system failure-inducing errors and to diagnose system failures. System failure occurs as a result of error propagation and the (unsuccessful) execution of error recovery mechanisms. Knowledge of error propagation patterns and unsuccessful error recovery is important for more accurate and detailed failure diagnosis, and knowledge of recovery protocols deployment is important for improving system reliability. This paper presents the CORRMEXT framework which carries failure diagnosis another significant step forward by analyzing and reporting error propagation patterns and degrees of success and failure of error recovery protocols. CORRMEXT uses both error messages and resource use data in its analyses. Application of CORRMEXT to data from the Ranger supercomputer have produced new insights. CORRMEXT has: (i) identified correlations between resource use counters that capture recovery attempts after an error, (ii) identified correlations between error events to capture error propagation patterns within the system, (iii) identified error propagation and recovery paths during system execution to explain system behaviour, (iv) showed that the earliest times of change in system behaviour can only be identified by analyzing both the correlated resource use counters and correlated errors. CORRMEXT will be installed on the HPC clusters at the Texas Advanced Computing Center in Autumn 2017.
机译:最近的工作已经分别(并一起)使用了故障日志和资源使用数据来检测导致系统故障的错误并诊断系统故障。由于错误传播和错误恢复机制(未成功执行)而导致系统故障。错误传播模式和错误恢复失败的知识对于更准确,详细的故障诊断很重要,恢复协议部署的知识对于提高系统可靠性也很重要。本文提出了CORRMEXT框架,该框架通过分析和报告错误传播模式以及错误恢复协议的成功和失败程度,使故障诊断又向前迈出了重要的一步。 CORRMEXT在分析中同时使用错误消息和资源使用数据。将CORRMEXT应用于Ranger超级计算机的数据产生了新的见解。 CORRMEXT具有:(i)确定资源使用计数器之间的相关性,这些资源使用计数器捕获发生错误后的恢复尝试;(ii)确定错误事件之间的相关性以捕获系统内的错误传播模式;(iii)确定系统执行期间错误的传播和恢复路径解释系统行为,(iv)表明,只有通过分析相关的资源使用计数器和相关的错误,才能确定系统行为变化的最早时间。 CORRMEXT将在2017年秋季安装在德克萨斯州高级计算中心的HPC群集上。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号