首页> 外文会议> >Software fault tolerance in a clustered architecture: techniques and reliability modeling
【24h】

Software fault tolerance in a clustered architecture: techniques and reliability modeling

机译:集群体系结构中的软件容错能力:技术和可靠性建模

获取原文

摘要

System architectures based on a cluster of computers have gained substantial attention recently. In a clustered system, complex software-intensive applications can be built with commercial hardware, operating systems, and application software to achieve high system availability and data integrity, while performance and cost penalties are greatly reduced by the use of separate error detection hardware and dedicated software fault tolerance routines. Within such a system a watchdog provides mechanisms for error detection and switch-over to a spare or backup processor in the presence of processor failures. The application software is responsible for the extent of the error detection, subsequent recovery actions and data backup. The application can be made as reliable as the user requires, being constrained only by the upper bounds on reliability imposed by the clustered architecture under various implementation schemes. We present reliability modeling and analysis of the clustered system by defining the hardware, operating system, and application software reliability techniques that need to be implemented to achieve different levels of reliability and comparable degrees of data consistency. We describe these reliability levels in terms of fault detection, fault recovery, volatile data consistency, and persistent data consistency, and develop a Markov reliability model to capture these fault detection and recovery activities. We also demonstrate how this cost-effective fault tolerant technique can provide quantitative reliability improvement within applications using clustered architectures.
机译:最近,基于计算机集群的系统体系结构受到了广泛关注。在集群系统中,可以使用商业硬件,操作系统和应用程序软件来构建复杂的软件密集型应用程序,以实现高系统可用性和数据完整性,同时通过使用单独的错误检测硬件和专用的硬件,可以大大降低性能和成本损失软件容错例程。在这样的系统中,看门狗提供了用于在存在处理器故障的情况下进行错误检测和切换到备用或备用处理器的机制。应用软件负责错误检测,后续恢复操作和数据备份的范围。可以使应用程序达到用户所需的可靠性,仅受各种实施方案下群集体系结构所施加的可靠性上限的约束。通过定义硬件,操作系统和应用程序软件的可靠性技术,我们需要对集群系统进行可靠性建模和分析,这些技术需要实现以实现不同级别的可靠性和相当程度的数据一致性。我们从故障检测,故障恢复,易失性数据一致性和持久性数据一致性方面描述了这些可靠性级别,并开发了一个马尔可夫可靠性模型来捕获这些故障检测和恢复活动。我们还演示了这种经济高效的容错技术如何在使用群集体系结构的应用程序中提供定量的可靠性改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号