【24h】

A Dynamic and Reliable Failure Detection and Failure Recovery Services in the Grid Systems

机译:网格系统中的动态,可靠的故障检测和故障恢复服务

获取原文

摘要

Fault tolerance and resource monitoring are the important services in the grid computing systems, which are comprised of heterogeneous and geographically distributed resources. The reliability and performance must be considered as a major criterion to execute the safety-critical applications in the grid systems. Since the failure of resources can leads to job execution failure, fault tolerance service is essential to satisfy dependability in grid systems. This paper proposes a fault tolerance and resource monitoring service to improve dependability factor with respect economic efficiency. Dynamic architecture of this method leads to reduce resource consumption, performance overhead and network traffic. The proposed fault tolerance service consists of failure detection and failure recovery. A two layered detection service is proposed to improve failure coverage and reduce the probability of false alarm states. Application-level Checkpointing technique with an appropriate graining size is proposed as recovery service to attain a tradeoff between failure detection latency and performance overhead. Analytical approach is used to analyze the reliability and efficiency of proposed Fault tolerance services.
机译:容错和资源监视是网格计算系统中的重要服务,这些系统由异构且地理分布的资源组成。必须将可靠性和性能视为在网格系统中执行安全关键型应用程序的主要标准。由于资源故障可能导致作业执行失败,因此容错服务对于满足网格系统的可靠性至关重要。本文提出了一种容错和资源监视服务,以提高经济效率方面的可靠性。这种方法的动态架构可减少资源消耗,性能开销和网络流量。提议的容错服务包括故障检测和故障恢复。提出了一种两层检测服务,以提高故障覆盖率并减少错误警报状态的可能性。提出了一种具有适当粒度的应用程序级检查点技术,作为恢复服务,可以在故障检测延迟和性能开销之间进行权衡。使用分析方法来分析所提出的容错服务的可靠性和效率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号