首页> 外文期刊>Journal of Parallel and Distributed Computing >Quantifying event correlations for proactive failure management in networked computing systems
【24h】

Quantifying event correlations for proactive failure management in networked computing systems

机译:量化事件相关性以在网络计算系统中进行主动故障管理

获取原文
获取原文并翻译 | 示例

摘要

Networked computing systems continue to grow in scale and in the complexity of their components and interactions. Component failures become norms instead of exceptions in these environments. Moreover, failure events exhibit strong correlations in the time and space domains. In this paper, we develop a spherical covariance model with an adjustable timescale parameter to quantify the temporal correlation and a stochastic model to characterize spatial correlation. The models are further extended to take into account the information of application allocation to discover more correlations among failure instances. We cluster failure events based on their correlations and predict their future occurrences. Experimental results on a production coalition system, the Wayne State Computational Grid, show the offline and online predictions made by our predicting system can forecast 72.7-85.3% of the failure occurrences and capture failure correlations in a cluster coalition environment.
机译:网络计算系统的规模以及组件和交互的复杂性不断增长。在这些环境中,组件故障已成为规范,而非异常。此外,故障事件在时域和空域中表现出很强的相关性。在本文中,我们开发了具有可调时标参数的球形协方差模型以量化时间相关性,并建立了随机模型以表征空间相关性。进一步扩展了模型,以考虑到应用程序分配的信息,以发现故障实例之间的更多关联。我们根据故障事件的相关性对它们进行聚类,并预测其未来的发生。在生产联盟系统Wayne状态计算网格上的实验结果表明,我们的预测系统进行的离线和在线预测可以预测72.7-85.3%的故障发生并捕获集群联盟环境中的故障​​相关性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号