...
首页> 外文期刊>Performance evaluation review >Co-designing The Failure Analysis And Monitoring Of large-scale Systems
【24h】

Co-designing The Failure Analysis And Monitoring Of large-scale Systems

机译:协同设计大型系统的故障分析与监控

获取原文
获取原文并翻译 | 示例

摘要

Large-scale distributed systems provide the backbone for numerous distributed applications and online services. These systems span over a multitude of computing nodes located at different geographical locations connected together via wide-area networks and overlays. A major concern with such systems is their susceptibility to failures leading to downtime of services and hence high monetary/business costs. In this paper, we argue that to understand failures in such a system, we need to co-design monitoring system with the failure analysis system. Unlike existing monitoring systems which are not designed specifically for failure analysis, we advocate a new way to design a monitoring system with the goal of uncovering causes of failures. Similarly the failure analysis techniques themselves need to go beyond simple statistical analysis of failure events in isolation to serve as an effective tool. Towards this end, we provide a discussion of some guiding principles for the co-design of monitoring and failure analysis systems for planetary scale systems.
机译:大型分布式系统为众多分布式应用程序和在线服务提供了骨干。这些系统跨越位于通过广域网和覆盖图连接在一起的不同地理位置的多个计算节点。这种系统的主要关注点是它们容易发生故障,从而导致服务停机,从而导致高昂的金钱/业务成本。在本文中,我们认为要了解此类系统中的故障,我们需要与故障分析系统共同设计监控系统。与不专门用于故障分析的现有监视系统不同,我们提倡一种新的方法来设计监视系统,以发现故障原因。同样,故障分析技术本身还需要超越对故障事件的简单统计分析,以孤立地充当有效工具。为此,我们提供了一些共同指导原则的讨论,这些指导原则是用于行星尺度系统的监测和故障分析系统的协同设计的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号