Enabling Dependability-Driven Resource Use and Message Log-Analysis for Cluster System Diagnosis

机译：启用依赖驱动的资源使用和消息日志分析以进行集群系统诊断

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Recent work have used both failure logs and resource use data separately (and together) to detect system failure-inducing errors and to diagnose system failures. System failure occurs as a result of error propagation and the (unsuccessful) execution of error recovery mechanisms. Knowledge of error propagation patterns and unsuccessful error recovery is important for more accurate and detailed failure diagnosis, and knowledge of recovery protocols deployment is important for improving system reliability. This paper presents the CORRMEXT framework which carries failure diagnosis another significant step forward by analyzing and reporting error propagation patterns and degrees of success and failure of error recovery protocols. CORRMEXT uses both error messages and resource use data in its analyses. Application of CORRMEXT to data from the Ranger supercomputer have produced new insights. CORRMEXT has: (i) identified correlations between resource use counters that capture recovery attempts after an error, (ii) identified correlations between error events to capture error propagation patterns within the system, (iii) identified error propagation and recovery paths during system execution to explain system behaviour, (iv) showed that the earliest times of change in system behaviour can only be identified by analyzing both the correlated resource use counters and correlated errors. CORRMEXT will be installed on the HPC clusters at the Texas Advanced Computing Center in Autumn 2017.

机译：最近的工作已经分别（并一起）使用了故障日志和资源使用数据来检测导致系统故障的错误并诊断系统故障。由于错误传播和错误恢复机制（未成功执行）而导致系统故障。错误传播模式和错误恢复失败的知识对于更准确，详细的故障诊断很重要，恢复协议部署的知识对于提高系统可靠性也很重要。本文提出了CORRMEXT框架，该框架通过分析和报告错误传播模式以及错误恢复协议的成功和失败程度，使故障诊断又向前迈出了重要的一步。 CORRMEXT在分析中同时使用错误消息和资源使用数据。将CORRMEXT应用于Ranger超级计算机的数据产生了新的见解。 CORRMEXT具有：（i）确定资源使用计数器之间的相关性，这些资源使用计数器捕获发生错误后的恢复尝试；（ii）确定错误事件之间的相关性以捕获系统内的错误传播模式；（iii）确定系统执行期间错误的传播和恢复路径解释系统行为，（iv）表明，只有通过分析相关的资源使用计数器和相关的错误，才能确定系统行为变化的最早时间。 CORRMEXT将在2017年秋季安装在德克萨斯州高级计算中心的HPC群集上。

著录项

来源
《2017 IEEE 24th International Conference on High Performance Computing》|2017年|317-327|共11页
会议地点 Jaipur(IN)
作者
Edward Chuah; Arshad Jhumka; Samantha Alt; Theo Damoulas; Nentawe Gurumdimma; Marie-Christine Sawley; William L. Barth; Tommy Minyard; James C. Browne;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类
关键词
Correlation; Data mining; Micromechanical devices; Electronic mail; Monitoring; Protocols; Tools;

机译：相关性;数据挖掘;微机械设备;电子邮件;监控;协议;工具;;

相似文献

外文文献
中文文献
专利

1. Towards comprehensive dependability-driven resource use and message log-analysis for HPC systems diagnosis [J] . Chuah Edward, Jhumka Arshad, Alt Samantha, Journal of Parallel and Distributed Computing . 2019,第OCTa期

机译：寻求基于可靠性的综合资源使用和消息日志分析，以进行HPC系统诊断
2. Towards comprehensive dependability-driven resource use and message log-analysis for HPC systems diagnosis [J] . Chuah Edward, Jhumka Arshad, Alt Samantha, Journal of Parallel and Distributed Computing . 2019,第Octa期

机译：朝向综合可靠性驱动的资源使用和消息日志分析，用于HPC系统诊断
3. Crocus: Enabling Computing Resource Orchestration for Inline Cluster-Wide Deduplication on Scalable Storage Systems [J] . Hamandawana Prince, Khan Awais, Lee Chang-Gyu, IEEE Transactions on Parallel and Distributed Systems . 2020,第8期

机译：Crocus：启用计算资源编程，在可伸缩存储系统上为内联群集重复数据删除
4. Enabling Dependability-Driven Resource Use and Message Log-Analysis for Cluster System Diagnosis [C] . Edward Chuah, Arshad Jhumka, Samantha Alt, IEEE International Conference on High Performance Computing . 2017

机译：启用可靠性驱动的资源使用和消息日志分析进行群集系统诊断
5. Extensible message layers for resource-rich cluster computers. [D] . Ulmer, Craig Douglas. 2002

机译：资源丰富的群集计算机的可扩展消息层。
6. A Smartphone Crowdsensing System Enabling Environmental Crowdsourcing for Municipality Resource Allocation with LSTM Stochastic Prediction [O] . Theodoros Anagnostopoulos, Theodoros Xanthopoulos, Yannis Psaromiligkos 2020

机译：一种智能手机众持系统使环境众窖与LSTM随机预测的市政府资源分配
7. Using message logs and resource use data for cluster failure diagnosis [O] . Chuah Edward, Jhumka Arshad, Browne James C., 2017

机译：使用消息日志和资源使用数据进行集群故障诊断

Enabling Dependability-Driven Resource Use and Message Log-Analysis for Cluster System Diagnosis

摘要

著录项

相似文献

相关主题

期刊订阅