【24h】

A fault detection service for wide area distributed computations

机译:用于广域分布式计算的故障检测服务

获取原文

摘要

The potential for faults in distributed computing systems is a significant complicating factor for application developers. While a variety of techniques exist for detecting and correcting faults, the implementation of these techniques in a particular context can be difficult. Hence, we propose a fault detection service designed to be incorporated, in a modular fashion, into distributed computing systems, tools, or applications. This service uses well-known techniques based on unreliable fault detectors to detect and report component failure, while allowing the user to tradeoff timeliness of reporting against false positive rates. We describe the architecture of this service, report on experimental results that quantify its cost and accuracy, and describe its use in two applications, monitoring the status of system components of the GUSTO computational grid testbed and as part of the NetSolve network-enabled numerical solver.
机译:对于应用程序开发人员来说,分布式计算系统中潜在的故障是一个复杂的因素。尽管存在用于检测和纠正故障的多种技术,但是在特定情况下实现这些技术可能很困难。因此,我们提出了一种故障检测服务,该服务旨在以模块化的方式并入分布式计算系统,工具或应用程序中。该服务使用基于不可靠故障检测器的众所周知的技术来检测和报告组件故障,同时允许用户权衡报告的及时性与误报率。我们将描述该服务的体系结构,报告可量化其成本和准确性的实验结果,并描述其在两个应用中的使用,监视GUSTO计算网格测试平台的系统组件的状态以及作为启用NetSolve网络的数值求解器的一部分。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号